+1 for a solution to this pressing issue!

I am seeing the same problem, in my case two symptoms:

1) low fetch speeds
2) crawls end "before their time" with "aborting with xxx hung
threads" error message

I am doing a focussed crawl on about 70.000 domains.
crawl.ignore.external.links is set to true.

In previous discussions on the list these issues have mainly been
attributed to crawls on such a limited set of domains.

See if I understand this correcly. FetchLists are hostwise disjoint,
thus all urls from the same domain are in the same FetchList. Folks
*not* on MapReduce are by definition always working with one Fetcher.
Otherwise could be many, in which case this mechanism prevents the
politeness rules from being disobeyed.

Could somebody confirm these assumptions are correct?

I have tried to work around the issues by changing the configuration.
I tried increasing fetcher.threads.fetch, http.timeout and
http.max.delays.

I also changed generate.max.per.host setting, following Doug's advice
of setting this value to TopN / Fetcher Threads, all to no lasting
avail.

So far, I haven't tried increasing the fetcher.threads.per.host to
more than 4 with 100 threads, though. I will do that now.

I really think we should gather some more data regarding fetch speed
problems. Maybe some of you who are seeing decent fetch speeds in a
focussed crawl setup could share some of their tips in tuning the
installation.

Thanks a lot for your time if you read so far :)

Rgrds, Thomas Delnoij




On 6/28/06, Sami Siren <[EMAIL PROTECTED]> wrote:
Ken Krugler wrote:

> Hi Doug,
>
>> Did you ever resolve your 0.8 vs 0.7 crawling performance question? I'm
>> running into a similar problem.
>
>
> We wound up dramatically increasing the number of threads, which
> seemed to help solve the bandwidth utilization problem. With Nutch 0.7
> we were running about 200 threads per crawler, and with Nutch 0.8 it's
> more like 2000+ threads...though you have to reduce the thread stack
> size in this type of configuration.
>
Fetchlist seems to be sorted by url.This leads to many threads being
blocked when crawler is configured to fetch by a low number of threads
per host (default 1) and there are several urls from same host in the
fetchlist.

This could perhaps be improved by sorting by some other key?

--
 Sami Siren




Reply via email to