Hi Matt,
I'm trying to do a full crawl (all the pages in the site) of about
100 sites. Unfortunately I'm getting as many errors as I am
successful fetches, mostly all max.delays.exceeded. Is there any way
to improve this so that I don't get this error as much? I tried
changing the max.delays property in the nutch conf to a higher
value, and I've also tried using fewer threads (went down from 100
to 50) but with no improvement really. This is using the
nutch-0.8-dev version. Any help would be immensely appreciated.
We've had the same experience. If you're doing a limited domain
crawl, then the polite crawler setting of one thread per host will
wind up triggering a lot of these max delays exceeded errors.
We're experimenting with a modified version of Nutch that has the
following changes:
a. Sort topN URLs by host IP address, and create a secondary "domain"
index into this list, with one entry for each group of identical IP
addresses.
b. Sort this "domain" index by # of URLs (max to min), and then have
each thread fetch every URL for a domain before moving on to the next
domain.
c. Use HTTP 1.1 keep-alive support to optimize sequential fetches of
pages from the same domain.
In order to quickly map a bunch of host domain names to IP addresses,
we had to write some code that fires up a bunch of threads. These
resolve (and effectively cache) all of the domain->IP address
mappings in parallel, thus avoiding a big performance hit from DNS
latency.
Two issues we're dealing with now are:
1. Some sites are trickling back data after our fetcher has decided
that it's downloaded everything it needs. This causes a long delay in
the fetch round, as we wait for the thread to terminate. I don't
think this is specific to doing a domain-limited crawl, but I thought
I'd mention it.
2. The protocol plugin API doesn't currently provide information back
to the fetcher that it needs to know how many URLs in a row it can
politely download, so we're having to do a bit of a hack to make that
work.
-- Ken
PS - One optimization would be to alter URL weights to avoid having
any one domain with a significantly higher percentage of URLs than
any other domain, but so far that hasn't been an issue for us.
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200