Thank you! Zaheed sent out a workaround in another thread as follows. Do you think this would
work (on Nutch 0.8 w/ DFS).

Also, when do you expect to port the feature to 0.8 (I know it's not the highest priority for you :)) - but really, merging index is critical for incremental crawls. Is it possible that it can be
implemented sooner? Please ... Our project depends on this ...

Thanks again for your help!

Olive

----------------------------------------------------------------------------------------
From :  Zaheed Haque <[EMAIL PROTECTED]>
Reply-To :  [email protected]
Sent :  Tuesday, April 4, 2006 4:12 PM
To :  [email protected]
Subject :  Re: Merging indexes -- please help....
Go to previous message | Go to next message | Delete | Inbox

You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..

I assume that you have two working index i.e "CrawlA" and "CrawlB"
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory "CrawlA" and "CrawlB"

Now make a new directory called "CrawlC"

mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current

Now copy the

cp -r CrawlA/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00000
cp -r CrawlB/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00001

NOTE the part-00001

Now make a directory segments under CrawlC
cd to CrawlC/segments

Now copy the

cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*

etc..

Now you should have under CrawlC two directory

crawldb
segments

Proceed with

- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes

Change your searcher.dir in nutch-site.xml and give it a go..



From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Query on merged indexes returned 0 hit - test case included (Nutch 0.8)
Date: Tue, 04 Apr 2006 18:29:07 +0200

Olive g wrote:
Hi Andrzej & other gurus who might be reading this message :-):

I ran some tests and somehow my query returned 0 hit against merged indexes. Here is my test case and it's a bit long, thank you in advance for your patience:

1. crawled the first 100 urls

~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 >& test1.log&

2. set searcher.dir to test1

3. query for "movie"
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie

it returned 64 hits (a web research with tomcat returned the same result)

4. crawled the second 100 urls

~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 >& test2.log&

5. set searcher.dir to test2

6. query for "movie"
 ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
it returned 55 hits (a web research with tomcat returned the same result)

7.  attempted to merge using the following command:
 ../search/bin/nutch merge test3 test1 test2 >& merge-test3&
 returned error:
Exception in thread "main" java.rmi.RemoteException: java.io.IOException: Cannot
open filename /user/root/test1/crawldb/segments
       at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)

8.  attempted to merge again using the following command:
../search/bin/nutch merge test4 test1/indexes test2/indexes >& merge-test4&
  merged successfully with no errors

9. set searcher.dir to test4

10.  query for "movie" by:
  ~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
and it returned 0 hit (a web research with tomcat returned the same result)

 060403 201545 10 opening segments in test4/segments
 060403 201545 10 found resource common-terms.utf8 at
 file:/root/nutch/search/conf/common-terms.utf8
 060403 201545 10 opening linkdb in test4/linkdb
 Total hits: 0

It appeared to be looking for test4/segments and test4/linkdb which did not exist?

Well, the short answer is that you cannot at the moment merge crawldbs or linkdbs. As a consequence, you cannot use multiple outputs of 'nutch crawl' together (because NutchBean needs to reference a single linkdb during searching).

This is technically possible, but simply not implemented (yet).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to