Matthias Jaekle wrote:
Adjusting the amount of downloads dynamically according to the response
time should be great.
But where is the advantage doing this per unique name?
So that we can get a larger portion of the content on sites that do have
a lot of capacity. Serializing and delaying are expensive for the
fetcher, so we'd like to avoid these when they're not needed. A dynamic
delay helps avoid unneeded delays, but does not permit multiple threads.
If the "industry standard" is to serialize on hostnames, not IPs, then
needn't punish ourselves by serializing on IPs. If by serializing on
hostnames, multiple threads end up accessing the same IP, then a dynamic
delay should make all threads slow down to an inoffensive rate. We want
to be as polite, but no more polite than is needed. Serializing on IPs
needlessly limits our fetching rate for fast sites.
Also, note that Nutch currently partitions fetchlists by hostname, not
IP, so that multiple fetchers may already access the same IP. So we are
not currently consistent.
Doug