In general I suggest using a shell script and doing the command
manually instead of using the crawl command, may something like.
NUTCH_HOME=$HOME/nutch-0.8-dev
while [ 1 ]
# or may just 10 rounds
do
DATE=$(date +%d%B%Y_%H%M%S)
$NUTCH_HOME/bin/nutch generate /user/nutchUser/
crawldb /user/nutchUser/segments -topN 5000000
s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/
segments | tail -1 | cut -c 1-38`
$NUTCH_HOME/bin/nutch fetch $s
$NUTCH_HOME/bin/nutch updatedb /user/nutchUser/
crawldb $s
# only when indexing $NUTCH_HOME/bin/nutch
invertlinks /user/nutchUser/linkdb /user/nutchUser/segments
# what to index, may the merged segment from the 10
rounds s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/
nutchUser/segments | tail -1 | cut -c 24-38`
# index $NUTCH_HOME/bin/nutch index /user/nutchUser/
indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/
nutchUser/segments/$s
done
This prevent you from merging crawl db's.
Than you only need the merged segment, the linkdb and the index from
the merged segment.
The 10 segments used to build the merged segment can be removed.
Hope this helps, you should only may change the scripts to have a 10
round loop to create you 10 segments and the merging command is also
not in the script.
Stefan
Am 21.12.2005 um 18:28 schrieb Bryan Woliner:
I am using nutch 0.7.1 (non-mapred) and am a little confused about
how to
move the contents of several "test" crawls into a single "live"
directory.
Any suggestions are very much appreciated!
I want to have a "Live" directory that contains all the indexes that
are ready to be searched.
The first index I want to add to the "Live" directory comes from a
crawl with 10 rounds of fetching, whose db and segments are stored in
the following directories:
/crawlA/db/
/crawlA/segments/
I can merge all of the segments in the segments directory (using
bin/nutch mergesegs), which results in the following (11th) segment
directory:
/crawlA/segments/20051219000754/
I can then index this 11th (i.e. merged) segment.
However, I have the following questions about which files and
directories should be moved to the "Live" directory:
1. If I copy /crawlA/db/ to /Live/db/ and copy
/crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
then I can start tomcat from /Live/ and I'm able to search the index
fine. However, I'm note sure if that can be duplicated for my crawlB
directory. I can't copy /crawlB/db/
to the "Live" directory because there is already a db directory there.
What are the correct files and directories to copy from each crawl
into the "Live" directory?
2. On a side note: am I even taking the correct approach in merging
the 10
segments in
the crawlA/segments/ directory before I index, or should I index each
segment first and then merge the 10 indexes? If I was to take the
latter approach (merging indexes instead of segments), which files
from the
/crawlA/ directory would I need
to move to the "Live" directory.
Thanks ahead of time for any helpful suggestions,
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net