Kai_testing Middleton wrote:
I've been reviewing the four different merge commands (as of nutch v0.9):
$ nutch | grep merg
mergedb merge crawldb-s, with optional filtering
mergesegs merge several segments, with optional filtering and slicing
mergelinkdb merge linkdb-s, with optional filtering
merge merge several segment indexes
Here are the javadocs:
mergedb --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
mergesegs --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
mergelinkdb --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
merge --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
Naively: why are there four merge commands? Are some subsets of the others?
Are they used in conjunction? What are the usage scenarios of each?
I notice that Andrzej wrote the first three, and they have wiki entries (pretty
much the same as the javadoc):
(I found these from http://www.mail-archive.com/[EMAIL PROTECTED]/msg03588.html)
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
It seems most of the nutch-user discussions I've seen so far relate to the simple merge command. Are the first three "advanced commands"?
They serve different purpose - let's assume that somehow you've got two
crawldb-s, e.g. you ran two crawls with different seed lists and
different filters. Now you want to take these collections of urls and
create a one big crawl. Then you would use mergedb to merge crawldb-s,
mergelinkdb to merge linkdb-s, and mergesegs to merge segments ;)
And a simple "merge" merges indexes of multiple segments, which is a
performance-related step in the regular Nutch work-cycle.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com