Hi Matt,

I'm trying to do a full crawl (all the pages in the site) of about 100 sites. Unfortunately I'm getting as many errors as I am successful fetches, mostly all max.delays.exceeded. Is there any way to improve this so that I don't get this error as much? I tried changing the max.delays property in the nutch conf to a higher value, and I've also tried using fewer threads (went down from 100 to 50) but with no improvement really. This is using the nutch-0.8-dev version. Any help would be immensely appreciated.

We've had the same experience. If you're doing a limited domain crawl, then the polite crawler setting of one thread per host will wind up triggering a lot of these max delays exceeded errors.

We're experimenting with a modified version of Nutch that has the following changes:

a. Sort topN URLs by host IP address, and create a secondary "domain" index into this list, with one entry for each group of identical IP addresses.

b. Sort this "domain" index by # of URLs (max to min), and then have each thread fetch every URL for a domain before moving on to the next domain.

c. Use HTTP 1.1 keep-alive support to optimize sequential fetches of pages from the same domain.

In order to quickly map a bunch of host domain names to IP addresses, we had to write some code that fires up a bunch of threads. These resolve (and effectively cache) all of the domain->IP address mappings in parallel, thus avoiding a big performance hit from DNS latency.

Two issues we're dealing with now are:

1. Some sites are trickling back data after our fetcher has decided that it's downloaded everything it needs. This causes a long delay in the fetch round, as we wait for the thread to terminate. I don't think this is specific to doing a domain-limited crawl, but I thought I'd mention it.

2. The protocol plugin API doesn't currently provide information back to the fetcher that it needs to know how many URLs in a row it can politely download, so we're having to do a bit of a hack to make that work.

-- Ken

PS - One optimization would be to alter URL weights to avoid having any one domain with a significantly higher percentage of URLs than any other domain, but so far that hasn't been an issue for us.
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to