Hi
On 7/16/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:
I've been reviewing the four different merge commands (as of nutch v0.9):
$ nutch | grep merg
mergedb merge crawldb-s, with optional filtering
mergesegs merge several segments, with optional filtering and slicing
mergelinkdb merge linkdb-s, with optional filtering
merge merge several segment indexes
Here are the javadocs:
mergedb --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
mergesegs --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
mergelinkdb --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
merge --
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html
Naively: why are there four merge commands? Are some subsets of the others?
Are they used in conjunction? What are the usage scenarios of each?
Each is used in a different scenario
mergedb: as its name does not imply, it is used to merge crawldb. So
consider this mergecrawldb
mergesegs: merges segments. It merges <segment>/{content,crawl_fetch,
crawl_generate, crawl_parse, parse_data, parse_text} information from
different segments.
merge: Merges lucene indexes. After a index job, you end up with a
indexes directory with a bunch of part-<num> directories inside.
Command merge takes such a directory and produces a single index. A
single index has a better performance (I think). You can say that
merge is poorly named, it should have been called mergeindexes or
something.
mergelinkdb: Should be obvious, merges linkdb-s.
So none of them is a subset of another. They all have different
purposes. It is kind of confusing to have a "merge" command that only
merges indexes, so perhaps we can add a mergeindexes command, keep
merge for some time (noting that it has been deprecated) then remove
it.
I notice that Andrzej wrote the first three, and they have wiki entries (pretty
much the same as the javadoc):
(I found these from http://www.mail-archive.com/[EMAIL PROTECTED]/msg03588.html)
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
It seems most of the nutch-user discussions I've seen so far relate to the simple merge
command. Are the first three "advanced commands"?
____________________________________________________________________________________
Yahoo! oneSearch: Finally, mobile search
that gives answers, not web links.
http://mobile.yahoo.com/mobileweb/onesearch?refer=1ONXIC
--
Doğacan Güney