Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Dennis Kubes Wed, 18 Jul 2007 18:57:41 -0700

If I am reading the message right :) then yes that problem would havebeen fixed by now. I believe that problem was with an earlier versionof Nutch (0.7).


Dennis Kubes


Kai_testing Middleton wrote:

Am I correct that the 'new' mergedb and mergelinkdb commands together would fix 
this problem from April 2006
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04112.html


                        Re: Query on merged indexes returned 0 hit - test case 
included (Nutch 0.8)
                        Andrzej Bialecki

                        Tue, 04 Apr 2006 09:29:51 -0700

                








Olive g wrote:
Hi Andrzej & other gurus who might be reading this message :-):

I ran some tests and somehow my query returned 0 hit against mergedindexes. Here is my test case and it's a bit long, thank you inadvance for your patience:

1. crawled the first 100 urls

~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 >&test1.log&

2. set searcher.dir to test1

3. query for "movie"
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

it returned 64 hits (a web research with tomcat returned the sameresult)

4. crawled the second 100 urls

~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 >&test2.log&

5. set searcher.dir to test2

6. query for "movie"
 ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

it returned 55 hits (a web research with tomcat returned the sameresult)

7.  attempted to merge using the following command:
 ../search/bin/nutch merge test3 test1 test2 >& merge-test3&
 returned error:

Exception in thread "main" java.rmi.RemoteException:java.io.IOException: Cannot

open filename /user/root/test1/crawldb/segments
       at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)

8.  attempted to merge again using the following command:

../search/bin/nutch merge test4 test1/indexes test2/indexes >&merge-test4&

  merged successfully with no errors

9. set searcher.dir to test4

10.  query for "movie" by:
  ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

and it returned 0 hit (a web research with tomcat returned the sameresult)

 060403 201545 10 opening segments in test4/segments
 060403 201545 10 found resource common-terms.utf8 at
 file:/root/nutch/search/conf/common-terms.utf8
 060403 201545 10 opening linkdb in test4/linkdb
 Total hits: 0

It appeared to be looking for test4/segments and test4/linkdb whichdid not exist?Well, the short answer is that you cannot at the moment merge crawldbsor linkdbs. As a consequence, you cannot use multiple outputs of 'nutchcrawl' together (because NutchBean needs to reference a single linkdbduring searching).

This is technically possible, but simply not implemented (yet).

--
Best regards,
Andrzej Bialecki

----- Original Message ----
From: Doğacan Güney <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, July 16, 2007 1:59:39 PM
Subject: Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Hi

On 7/16/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote:

I've been reviewing the four different merge commands (as of nutch v0.9):

$ nutch | grep merg
  mergedb           merge crawldb-s, with optional filtering
  mergesegs         merge several segments, with optional filtering and slicing
  mergelinkdb       merge linkdb-s, with optional filtering
  merge             merge several segment indexes

Here are the javadocs:
mergedb -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/CrawlDbMerger.html
mergesegs -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/segment/SegmentMerger.html
mergelinkdb -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/LinkDbMerger.html
merge -- 
http://lucene.apache.org/nutch/apidocs/org/apache/nutch/indexer/IndexMerger.html

Naively: why are there four merge commands? Are some subsets of the others?  
Are they used in conjunction? What are the usage scenarios of each?


Each is used in a different scenario
mergedb: as its name does not imply, it is used to merge crawldb. So
consider this mergecrawldb

mergesegs: merges segments. It merges <segment>/{content,crawl_fetch,
crawl_generate, crawl_parse, parse_data, parse_text} information from
different segments.

merge: Merges lucene indexes. After a index job, you end up with a
indexes directory with a bunch of part-<num> directories inside.
Command merge takes such a directory and produces a single index. A
single index has a better performance (I think). You can say that
merge is poorly named, it should have been called mergeindexes or
something.

mergelinkdb: Should be obvious, merges linkdb-s.

So none of them is a subset of another. They all have different
purposes. It is kind of confusing to have a "merge" command that only
merges indexes, so perhaps we can add a mergeindexes command, keep
merge for some time (noting that it has been deprecated) then remove
it.

I notice that Andrzej wrote the first three, and they have wiki entries (pretty 
much the same as the javadoc):
(I found these from http://www.mail-archive.com/[EMAIL PROTECTED]/msg03588.html)
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergedb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergelinkdb
http://wiki.apache.org/nutch/nutch-0.8-dev/bin/nutch_mergesegs
It seems most of the nutch-user discussions I've seen so far relate to the simple merge 
command.  Are the first three "advanced commands"?

____________________________________________________________________________________Got a little couch potato?Check out fun summer activities for kids.http://search.yahoo.com/search?fr=oni_on_mail&p=summer+activities+for+kids&cs=bz

Re: four nutch merge commands: mergedb, mergesegs, mergelinkdb, merge

Reply via email to