Matthias Jaekle wrote:
All big search engines seems to think that each subdomain is a single host and stress each subdomain as single host.

So most other crawlers use the hostname, not the ip. That's good to know. So if we did make this change, we would not be less polite than others.

In our case that's bad. We could not handle one request for each subdomain at the same time. So we have to answer with 503 much searchengines according to the system load. Specially it is very bad, if all the big search engines crawl our subdomains at same time.

So using IP adresses to limit the access to one host is much better.

Better for your site, yes, but for some sites, which have the capacity, it would be better to be more aggressive than we are now, so that we can crawl more of the site. I've been recently running dmoz-seeded crawls of the whole web, and find it hard to use much bandwidth and not get a lot of "http.max.delays exceeded" errors, meaning I'm simply unable to fetch much of many popular sites.

Perhaps a dynamic property would help. If the elapsed time of the previous request is some fraction of the delay then we might lessen the delay. Similarly, if it is greater or if we get 503s, then we might increase it. For example, if the fraction were .5 and the delay is 2 seconds, then sites which respond faster than a second would get their delay decreased, and sites which respond in more than a second or that return 503 would have their delay increased. Do you think this would be effective with your site?

Doug

Reply via email to