Kai_testing Middleton wrote:
I've been reviewing the four different merge commands (as of nutch v0.9):

$ nutch | grep merg
  mergedb           merge crawldb-s, with optional filtering
  mergesegs         merge several segments, with optional filtering and slicing
  mergelinkdb       merge linkdb-s, with optional filtering
  merge             merge several segment indexes

Here are the javadocs:
mergedb -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
mergesegs -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
mergelinkdb -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
merge -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html

Naively: why are there four merge commands? Are some subsets of the others?  
Are they used in conjunction? What are the usage scenarios of each?

I notice that Andrzej wrote the first three, and they have wiki entries (pretty 
much the same as the javadoc):
(I found these from http://www.mail-archive.com/[EMAIL PROTECTED]/msg03588.html)
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
It seems most of the nutch-user discussions I've seen so far relate to the simple merge command. Are the first three "advanced commands"?

They serve different purpose - let's assume that somehow you've got two crawldb-s, e.g. you ran two crawls with different seed lists and different filters. Now you want to take these collections of urls and create a one big crawl. Then you would use mergedb to merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments ;)

And a simple "merge" merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to