Threads blocked by blockAddr()

dayzman Fri, 06 Feb 2009 17:04:20 -0800

Hi,

I'm trying to crawl about 2mil pages from the web by using the Intranetmethod, ie bin/nutch crawl dmoz/urls -dir crawl -threads 20 -depth5 -topN2000000. When it reaches about the 3rd level, most of the threads areblocked by blockAddr(). I see that blockAddr() blocks threads when thenumber of threads accessing the host exceeds fetcher.threads.per.host,which is set to 2 at the moment. By the 3rd level, about 100k pages wouldhave been crawled. My question is: why do the threads access just a fewhosts while there are (potentially) many other hosts for them to try? Isthere a way to tweak around this?


Thanks!

Michael

Threads blocked by blockAddr()

Reply via email to