Thank you! Zaheed sent out a workaround in another thread as follows. Do you
think this would
work (on Nutch 0.8 w/ DFS).
Also, when do you expect to port the feature to 0.8 (I know it's not the
highest priority for
you :)) - but really, merging index is critical for incremental crawls. Is
it possible that it can be
implemented sooner? Please ... Our project depends on this ...
Thanks again for your help!
Olive
----------------------------------------------------------------------------------------
From : Zaheed Haque <[EMAIL PROTECTED]>
Reply-To : [email protected]
Sent : Tuesday, April 4, 2006 4:12 PM
To : [email protected]
Subject : Re: Merging indexes -- please help....
Go to previous message | Go to next message | Delete | Inbox
You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..
I assume that you have two working index i.e "CrawlA" and "CrawlB"
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory "CrawlA" and "CrawlB"
Now make a new directory called "CrawlC"
mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current
Now copy the
cp -r CrawlA/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00000
cp -r CrawlB/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00001
NOTE the part-00001
Now make a directory segments under CrawlC
cd to CrawlC/segments
Now copy the
cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*
etc..
Now you should have under CrawlC two directory
crawldb
segments
Proceed with
- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes
Change your searcher.dir in nutch-site.xml and give it a go..
From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Query on merged indexes returned 0 hit - test case included
(Nutch 0.8)
Date: Tue, 04 Apr 2006 18:29:07 +0200
Olive g wrote:
Hi Andrzej & other gurus who might be reading this message :-):
I ran some tests and somehow my query returned 0 hit against merged
indexes. Here is my test case and it's a bit long, thank you in advance
for your patience:
1. crawled the first 100 urls
~/nutch/search/bin/nutch crawl urls-001-100 -dir test1 -depth 1 >&
test1.log&
2. set searcher.dir to test1
3. query for "movie"
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
it returned 64 hits (a web research with tomcat returned the same
result)
4. crawled the second 100 urls
~/nutch/search/bin/nutch crawl urls-101-200 -dir test2 -depth 1 >&
test2.log&
5. set searcher.dir to test2
6. query for "movie"
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
it returned 55 hits (a web research with tomcat returned the same
result)
7. attempted to merge using the following command:
../search/bin/nutch merge test3 test1 test2 >& merge-test3&
returned error:
Exception in thread "main" java.rmi.RemoteException:
java.io.IOException: Cannot
open filename /user/root/test1/crawldb/segments
at org.apache.hadoop.dfs.NameNode.open(NameNode.java:120)
8. attempted to merge again using the following command:
../search/bin/nutch merge test4 test1/indexes test2/indexes >&
merge-test4&
merged successfully with no errors
9. set searcher.dir to test4
10. query for "movie" by:
~/nutch/search/bin/nutch org.apache.nutch.searcher.NutchBean movie
and it returned 0 hit (a web research with tomcat returned the same
result)
060403 201545 10 opening segments in test4/segments
060403 201545 10 found resource common-terms.utf8 at
file:/root/nutch/search/conf/common-terms.utf8
060403 201545 10 opening linkdb in test4/linkdb
Total hits: 0
It appeared to be looking for test4/segments and test4/linkdb which did
not exist?
Well, the short answer is that you cannot at the moment merge crawldbs or
linkdbs. As a consequence, you cannot use multiple outputs of 'nutch crawl'
together (because NutchBean needs to reference a single linkdb during
searching).
This is technically possible, but simply not implemented (yet).
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general