Matthias Jaekle wrote:
Adjusting the amount of downloads dynamically according to the response
time should be great.

But where is the advantage doing this per unique name?

So that we can get a larger portion of the content on sites that do have a lot of capacity. Serializing and delaying are expensive for the fetcher, so we'd like to avoid these when they're not needed. A dynamic delay helps avoid unneeded delays, but does not permit multiple threads. If the "industry standard" is to serialize on hostnames, not IPs, then needn't punish ourselves by serializing on IPs. If by serializing on hostnames, multiple threads end up accessing the same IP, then a dynamic delay should make all threads slow down to an inoffensive rate. We want to be as polite, but no more polite than is needed. Serializing on IPs needlessly limits our fetching rate for fast sites.

Also, note that Nutch currently partitions fetchlists by hostname, not IP, so that multiple fetchers may already access the same IP. So we are not currently consistent.

Doug

Reply via email to