Ken Krugler wrote:
Hi Doug,
Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
running into a similar problem.
We wound up dramatically increasing the number of threads, which
seemed to help solve the bandwidth utilization problem. With Nutch 0.7
we were running about 200 threads per crawler, and with Nutch 0.8 it's
more like 2000+ threads...though you have to reduce the thread stack
size in this type of configuration.
Fetchlist seems to be sorted by url.This leads to many threads being
blocked when crawler is configured to fetch by a low number of threads
per host (default 1) and there are several urls from same host in the
fetchlist.
This could perhaps be improved by sorting by some other key?
--
Sami Siren