Re: max delays error

Ken Krugler Thu, 03 Nov 2005 15:27:04 -0800

Hi Matt,

I'm trying to do a full crawl (all the pages in the site) of about100 sites. Unfortunately I'm getting as many errors as I amsuccessful fetches, mostly all max.delays.exceeded. Is there any wayto improve this so that I don't get this error as much? I triedchanging the max.delays property in the nutch conf to a highervalue, and I've also tried using fewer threads (went down from 100to 50) but with no improvement really. This is using thenutch-0.8-dev version. Any help would be immensely appreciated.

We've had the same experience. If you're doing a limited domaincrawl, then the polite crawler setting of one thread per host willwind up triggering a lot of these max delays exceeded errors.

We're experimenting with a modified version of Nutch that has thefollowing changes:

a. Sort topN URLs by host IP address, and create a secondary "domain"index into this list, with one entry for each group of identical IPaddresses.

b. Sort this "domain" index by # of URLs (max to min), and then haveeach thread fetch every URL for a domain before moving on to the nextdomain.

c. Use HTTP 1.1 keep-alive support to optimize sequential fetches ofpages from the same domain.

In order to quickly map a bunch of host domain names to IP addresses,we had to write some code that fires up a bunch of threads. Theseresolve (and effectively cache) all of the domain->IP addressmappings in parallel, thus avoiding a big performance hit from DNSlatency.


Two issues we're dealing with now are:

1. Some sites are trickling back data after our fetcher has decidedthat it's downloaded everything it needs. This causes a long delay inthe fetch round, as we wait for the thread to terminate. I don'tthink this is specific to doing a domain-limited crawl, but I thoughtI'd mention it.

2. The protocol plugin API doesn't currently provide information backto the fetcher that it needs to know how many URLs in a row it canpolitely download, so we're having to do a bit of a hack to make thatwork.


-- Ken

PS - One optimization would be to alter URL weights to avoid havingany one domain with a significantly higher percentage of URLs thanany other domain, but so far that hasn't been an issue for us.

--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: max delays error

Reply via email to