Re: [Nutch-general] nutch 0.8 (+ hadoop 0.5) does not crawl reliably

Andrzej Bialecki Tue, 24 Oct 2006 18:27:19 -0700

Teruhiko Kurosaka wrote:
> I am using nutch 0.8 (with hadoop 0.5 to get around
> the Java Exception that I have asked a few months ago about)
> with a custome analyzer plugin and some modification to
> NutchAnalysis.jj. 
>
> I ran "nutch crawl" over the same test site of just three HTML 
> files after clearing the index directory.  Two out of three tries,
> the crawl session only fetches the index page only.  Only one run
> (out of three tries) successfully fetches all pages.  All the
> crawl runs are done using the exact same parameters.
>
> Have anybody experienced strange behaviors like this?
>


There was a bug in some versions of 0.8, so that if you ran it with 
"local" FS & jobtracker it would generate too many parts of the 
fetchlist, and then process only one randomly selected part. If that's 
the case, and you are indeed running in "local" mode, try setting the 
number of map and reduce tasks in your hadoop-site.xml to 1.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] nutch 0.8 (+ hadoop 0.5) does not crawl reliably

Reply via email to