[Nutch-general] Re: Query on merged indexes returned 0 hit - more issues

Olive g Tue, 04 Apr 2006 17:28:03 -0700

Hi gurus,

I tried the workaround and I found some more issues. It appears to me thatinverlinks does not work properly with more than 5 input parts.

For example the following command (with number of map tasks set to 5 and thenumber of reduce tasks set to 5, using dfs, nutch 0.8)../search/bin/nutch invertlinks test5/linkdb test5/segments/20060403192429test5/segments/20060403193814 >& linkdb-test5&


generated basically the same error for all 5 reduce tasks:

java.rmi.RemoteException: java.io.IOException: Could not complete write tofile /user/root/test5/linkdb/362527374/part-00000/.data.crc

by DFSClient_441718647 at java.lang.Throwable.(Throwable.java:57) atjava.lang.Throwable.(Throwable.java:68) at


org.apache.hadoop.dfs.NameNode.complete(NameNode.java:205)


the contents of test5/segments/20060403192429/content/ are

/user/root/test5/segments/20060403192429/content/part-00000     123617
/user/root/test5/segments/20060403192429/content/part-00001     141105
/user/root/test5/segments/20060403192429/content/part-00002     168565
/user/root/test5/segments/20060403192429/content/part-00003     179788
/user/root/test5/segments/20060403192429/content/part-00004     70356

the contents of test5/segments/20060403193814/content/ are

/user/root/test5/segments/20060403193814/content/part-00000     103014
/user/root/test5/segments/20060403193814/content/part-00001     159010
/user/root/test5/segments/20060403193814/content/part-00002     92892
/user/root/test5/segments/20060403193814/content/part-00003     103847
/user/root/test5/segments/20060403193814/content/part-00004     102626

In the example above there are 10 input parts in two segments. I noticedthat this doesn't happen when there are no more than 5 input parts and itconsistently happens when there are more than 5, even if they are in thesame segment.

The urgency of this problem is that it prevents incremental crawling,whether by merging segments or by incremental depth crawling, because after5 more incremental crawls we have 6 parts.


Please let me know what you think.

Thank you!

Olive

From: Andrzej Bialecki <[EMAIL PROTECTED]>
Reply-To: [email protected]
To: [email protected]
Subject: Re: Query on merged indexes returned 0 hit - test case included(Nutch 0.8)
Date: Tue, 04 Apr 2006 19:20:43 +0200

Olive g wrote:
Thank you! Zaheed sent out a workaround in another thread as follows. Doyou think this would
work (on Nutch 0.8 w/ DFS).
Yes, it should work. This is a cheap way to merge two DBs - thanks Zaheed!Just remember to rename the part-xxxxx dirs so that they are sequential.
Also, when do you expect to port the feature to 0.8 (I know it's not thehighest priority foryou :)) - but really, merging index is critical for incremental crawls. Isit possible that it can be
implemented sooner? Please ... Our project depends on this ...
These features (incremental updates, merging indexes) are already supportedif you use individual command-line tools and a single DB. So, I'm notplanning to do anything about it.
--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


_________________________________________________________________

Express yourself instantly with MSN Messenger! Download today - it's FREE!http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Query on merged indexes returned 0 hit - more issues

Reply via email to