[Nutch-general] Re: which files/directories are needed after a segment or index merge

Arun Kaundal Thu, 22 Dec 2005 00:11:04 -0800

Stefan
  I am using window xp pro. Can u let me know how I can achieve same in
windows environment.
What steps I need to follow so that I neither overwrite nor merge two web db
together. Instead web db update as and when new list of urls are added.



On 12/22/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> In general I suggest using a shell script and doing the command
> manually instead of using the crawl command, may something like.
>
> NUTCH_HOME=$HOME/nutch-0.8-dev
>
> while [ 1 ]
> # or may just 10 rounds
> do
> DATE=$(date +%d%B%Y_%H%M%S)
>
>                 $NUTCH_HOME/bin/nutch generate /user/nutchUser/
> crawldb /user/nutchUser/segments -topN 5000000
>                 s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/
> segments | tail -1 | cut -c 1-38`
>                 $NUTCH_HOME/bin/nutch fetch $s
>                 $NUTCH_HOME/bin/nutch updatedb /user/nutchUser/
> crawldb $s
> # only when indexing                $NUTCH_HOME/bin/nutch
> invertlinks /user/nutchUser/linkdb /user/nutchUser/segments
> # what to index, may the merged segment from the 10
> rounds                s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/
> nutchUser/segments | tail -1 | cut -c 24-38`
> # index                $NUTCH_HOME/bin/nutch index /user/nutchUser/
> indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/
> nutchUser/segments/$s
>
> done
>
> This prevent you from merging crawl db's.
> Than you only need the merged segment, the linkdb and the index from
> the merged segment.
> The 10 segments used to build the merged segment can be removed.
>
> Hope this helps, you should only may change the scripts to have a 10
> round loop  to create you 10 segments and the merging command is also
> not in the script.
> Stefan
>
> Am 21.12.2005 um 18:28 schrieb Bryan Woliner:
>
> > I am using nutch 0.7.1 (non-mapred) and am a little confused about
> > how to
> > move the contents of several "test" crawls into a single "live"
> > directory.
> > Any suggestions are very much appreciated!
> >
> > I want to have a "Live" directory that contains all the indexes that
> > are ready to be searched.
> >
> > The first index I want to add to the "Live" directory comes from a
> > crawl with 10 rounds of fetching, whose db and segments are stored in
> > the following directories:
> >
> > /crawlA/db/
> > /crawlA/segments/
> >
> > I can merge all of the segments in the segments directory (using
> > bin/nutch mergesegs), which results in the following (11th) segment
> > directory:
> >
> > /crawlA/segments/20051219000754/
> >
> > I can then index this 11th (i.e. merged) segment.
> >
> > However, I have the following questions about which files and
> > directories should be moved to the "Live" directory:
> >
> > 1. If I copy /crawlA/db/ to /Live/db/  and copy
> > /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
> > then I can start tomcat from /Live/ and I'm able to search the index
> > fine. However, I'm note sure if that can be duplicated for my crawlB
> > directory. I can't copy /crawlB/db/
> > to the "Live" directory because there is already a db directory there.
> > What are the correct files and directories to copy from each crawl
> > into the "Live" directory?
> >
> > 2. On a side note: am I even taking the correct approach in merging
> > the 10
> > segments in
> > the crawlA/segments/ directory before I index, or should I index each
> > segment first and then merge the 10 indexes? If I was to take the
> > latter approach (merging indexes instead of segments), which files
> > from the
> > /crawlA/ directory would I need
> > to move to the "Live" directory.
> >
> > Thanks ahead of time for any helpful suggestions,
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>

[Nutch-general] Re: which files/directories are needed after a segment or index merge

Reply via email to