[Nutch-general] Re: which files/directories are needed after a segment or index merge

Stefan Groschupf Wed, 21 Dec 2005 11:11:14 -0800

In general I suggest using a shell script and doing the commandmanually instead of using the crawl command, may something like.


NUTCH_HOME=$HOME/nutch-0.8-dev


while [ 1 ]
# or may just 10 rounds
do
DATE=$(date +%d%B%Y_%H%M%S)

$NUTCH_HOME/bin/nutch generate /user/nutchUser/crawldb /user/nutchUser/segments -topN 5000000s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/segments | tail -1 | cut -c 1-38`

                $NUTCH_HOME/bin/nutch fetch $s

$NUTCH_HOME/bin/nutch updatedb /user/nutchUser/crawldb $s# only when indexing $NUTCH_HOME/bin/nutchinvertlinks /user/nutchUser/linkdb /user/nutchUser/segments# what to index, may the merged segment from the 10rounds s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/segments | tail -1 | cut -c 24-38`# index $NUTCH_HOME/bin/nutch index /user/nutchUser/indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/nutchUser/segments/$s


done

This prevent you from merging crawl db's.

Than you only need the merged segment, the linkdb and the index fromthe merged segment.

The 10 segments used to build the merged segment can be removed.

Hope this helps, you should only may change the scripts to have a 10round loop to create you 10 segments and the merging command is alsonot in the script.

Stefan

Am 21.12.2005 um 18:28 schrieb Bryan Woliner:

I am using nutch 0.7.1 (non-mapred) and am a little confused abouthow tomove the contents of several "test" crawls into a single "live"directory.

Any suggestions are very much appreciated!

I want to have a "Live" directory that contains all the indexes that
are ready to be searched.

The first index I want to add to the "Live" directory comes from a
crawl with 10 rounds of fetching, whose db and segments are stored in
the following directories:

/crawlA/db/
/crawlA/segments/

I can merge all of the segments in the segments directory (using
bin/nutch mergesegs), which results in the following (11th) segment
directory:

/crawlA/segments/20051219000754/

I can then index this 11th (i.e. merged) segment.

However, I have the following questions about which files and
directories should be moved to the "Live" directory:

1. If I copy /crawlA/db/ to /Live/db/  and copy
/crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
then I can start tomcat from /Live/ and I'm able to search the index
fine. However, I'm note sure if that can be duplicated for my crawlB
directory. I can't copy /crawlB/db/
to the "Live" directory because there is already a db directory there.
What are the correct files and directories to copy from each crawl
into the "Live" directory?

2. On a side note: am I even taking the correct approach in mergingthe 10

segments in
the crawlA/segments/ directory before I index, or should I index each
segment first and then merge the 10 indexes? If I was to take the

latter approach (merging indexes instead of segments), which filesfrom the

/crawlA/ directory would I need
to move to the "Live" directory.

Thanks ahead of time for any helpful suggestions,


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

[Nutch-general] Re: which files/directories are needed after a segment or index merge

Reply via email to