Hi,
I'm trying to crawl about 2mil pages from the web by using the Intranet
method, ie bin/nutch crawl dmoz/urls -dir crawl -threads 20 -depth5 -topN
2000000. When it reaches about the 3rd level, most of the threads are
blocked by blockAddr(). I see that blockAddr() blocks threads when the
number of threads accessing the host exceeds fetcher.threads.per.host,
which is set to 2 at the moment. By the 3rd level, about 100k pages would
have been crawled. My question is: why do the threads access just a few
hosts while there are (potentially) many other hosts for them to try? Is
there a way to tweak around this?
Thanks!
Michael
- Threads blocked by blockAddr() dayzman
-