Matthias Jaekle wrote:
All big search engines seems to think that each subdomain is a single
host and stress each subdomain as single host.
So most other crawlers use the hostname, not the ip. That's good to
know. So if we did make this change, we would not be less polite than
others.
In our case that's bad. We could not handle one request for each
subdomain at the same time. So we have to answer with 503 much
searchengines according to the system load. Specially it is very bad, if
all the big search engines crawl our subdomains at same time.
So using IP adresses to limit the access to one host is much better.
Better for your site, yes, but for some sites, which have the capacity,
it would be better to be more aggressive than we are now, so that we can
crawl more of the site. I've been recently running dmoz-seeded crawls
of the whole web, and find it hard to use much bandwidth and not get a
lot of "http.max.delays exceeded" errors, meaning I'm simply unable to
fetch much of many popular sites.
Perhaps a dynamic property would help. If the elapsed time of the
previous request is some fraction of the delay then we might lessen the
delay. Similarly, if it is greater or if we get 503s, then we might
increase it. For example, if the fraction were .5 and the delay is 2
seconds, then sites which respond faster than a second would get their
delay decreased, and sites which respond in more than a second or that
return 503 would have their delay increased. Do you think this would be
effective with your site?
Doug