[email protected] wrote:
Hi,

I'm trying to crawl about 2mil pages from the web by using the Intranet method, ie bin/nutch crawl dmoz/urls -dir crawl -threads 20 -depth5 -topN 2000000. When it reaches about the 3rd level, most of the threads are blocked by blockAddr(). I see that blockAddr() blocks threads when the number of threads accessing the host exceeds fetcher.threads.per.host, which is set to 2 at the moment. By the 3rd level, about 100k pages would have been crawled. My question is: why do the threads access just a few hosts while there are (potentially) many other hosts for them to try? Is there a way to tweak around this?

Use Fetcher2 instead of Fetcher - it uses a different queue and blocking algorithm, which works better in such cases.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to