Kai_testing Middleton wrote: > I've been reviewing the four different merge commands (as of nutch v0.9): > > $ nutch | grep merg > mergedb merge crawldb-s, with optional filtering > mergesegs merge several segments, with optional filtering and > slicing > mergelinkdb merge linkdb-s, with optional filtering > merge merge several segment indexes > > Here are the javadocs: > mergedb -- > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html > mergesegs -- > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html > mergelinkdb -- > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html > merge -- > http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html > > Naively: why are there four merge commands? Are some subsets of the others? > Are they used in conjunction? What are the usage scenarios of each? > > I notice that Andrzej wrote the first three, and they have wiki entries > (pretty much the same as the javadoc): > (I found these from http://www.mail-archive.com/[EMAIL > PROTECTED]/msg03588.html) > http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb > http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb > http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs > It seems most of the nutch-user discussions I've seen so far relate to the > simple merge command. Are the first three "advanced commands"? >
They serve different purpose - let's assume that somehow you've got two crawldb-s, e.g. you ran two crawls with different seed lists and different filters. Now you want to take these collections of urls and create a one big crawl. Then you would use mergedb to merge crawldb-s, mergelinkdb to merge linkdb-s, and mergesegs to merge segments ;) And a simple "merge" merges indexes of multiple segments, which is a performance-related step in the regular Nutch work-cycle. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
