[Nutch-general] Re: which files/directories are needed after a segment or index merge

Stefan Groschupf Thu, 22 Dec 2005 00:59:10 -0800

I am using window xp pro. Can u let me know how I can achieve same in
windows environment.

That is easy, install cygwin than also your m$ os works like a unix. :-D

What steps I need to follow so that I neither overwrite nor mergetwo web db
together.

I'm not sure understand your question correct, but you can storedifferent webdb's in different folders.

Instead web db update as and when new list of urls are added.

Update this webdb with the fresh fetched segments - you wish to getnew urls into.

Does that answer the question?
Stefan



On 12/22/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:


In general I suggest using a shell script and doing the command
manually instead of using the crawl command, may something like.

NUTCH_HOME=$HOME/nutch-0.8-dev

while [ 1 ]
# or may just 10 rounds
do
DATE=$(date +%d%B%Y_%H%M%S)

                $NUTCH_HOME/bin/nutch generate /user/nutchUser/
crawldb /user/nutchUser/segments -topN 5000000
                s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/
segments | tail -1 | cut -c 1-38`
                $NUTCH_HOME/bin/nutch fetch $s
                $NUTCH_HOME/bin/nutch updatedb /user/nutchUser/
crawldb $s
# only when indexing                $NUTCH_HOME/bin/nutch
invertlinks /user/nutchUser/linkdb /user/nutchUser/segments
# what to index, may the merged segment from the 10
rounds                s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/
nutchUser/segments | tail -1 | cut -c 24-38`
# index                $NUTCH_HOME/bin/nutch index /user/nutchUser/
indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/
nutchUser/segments/$s

done

This prevent you from merging crawl db's.
Than you only need the merged segment, the linkdb and the index from
the merged segment.
The 10 segments used to build the merged segment can be removed.

Hope this helps, you should only may change the scripts to have a 10
round loop  to create you 10 segments and the merging command is also
not in the script.
Stefan

Am 21.12.2005 um 18:28 schrieb Bryan Woliner:

I am using nutch 0.7.1 (non-mapred) and am a little confused about
how to
move the contents of several "test" crawls into a single "live"
directory.
Any suggestions are very much appreciated!

I want to have a "Live" directory that contains all the indexes that
are ready to be searched.

The first index I want to add to the "Live" directory comes from a

crawl with 10 rounds of fetching, whose db and segments arestored in

the following directories:

/crawlA/db/
/crawlA/segments/

I can merge all of the segments in the segments directory (using
bin/nutch mergesegs), which results in the following (11th) segment
directory:

/crawlA/segments/20051219000754/

I can then index this 11th (i.e. merged) segment.

However, I have the following questions about which files and
directories should be moved to the "Live" directory:

1. If I copy /crawlA/db/ to /Live/db/  and copy
/crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ ,
then I can start tomcat from /Live/ and I'm able to search the index
fine. However, I'm note sure if that can be duplicated for my crawlB
directory. I can't copy /crawlB/db/

to the "Live" directory because there is already a db directorythere.

What are the correct files and directories to copy from each crawl
into the "Live" directory?

2. On a side note: am I even taking the correct approach in merging
the 10
segments in

the crawlA/segments/ directory before I index, or should I indexeach

segment first and then merge the 10 indexes? If I was to take the
latter approach (merging indexes instead of segments), which files
from the
/crawlA/ directory would I need
to move to the "Live" directory.

Thanks ahead of time for any helpful suggestions,


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: which files/directories are needed after a segment or index merge

Reply via email to