Re: Threads blocked by blockAddr()

Andrzej Bialecki Sat, 07 Feb 2009 01:53:41 -0800

[email protected] wrote:

Hi,
I'm trying to crawl about 2mil pages from the web by using the Intranetmethod, ie bin/nutch crawl dmoz/urls -dir crawl -threads 20 -depth5-topN 2000000. When it reaches about the 3rd level, most of the threadsare blocked by blockAddr(). I see that blockAddr() blocks threads whenthe number of threads accessing the host exceedsfetcher.threads.per.host, which is set to 2 at the moment. By the 3rdlevel, about 100k pages would have been crawled. My question is: why dothe threads access just a few hosts while there are (potentially) manyother hosts for them to try? Is there a way to tweak around this?

Use Fetcher2 instead of Fetcher - it uses a different queue and blockingalgorithm, which works better in such cases.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Threads blocked by blockAddr()

Reply via email to