I thought I would post some of my experience with download rates....

We have built a large scale crawler that has crawled over 2.4 billion urls and continues to crawl at upwards of 500 pages/second. In tuning the download policy we
found that both the hit rate and number of pages downloaded per day both come into play
when trying to tread lightly. An easy but delayed measure of whether you are treading lightly or not is to monitor such sites as www.webmasterworld.com and the like. A more direct measure is the volume and types of complaints that come over email.


From our experience, the bulk of the complaints come from the webmasters/businesses/etc. who purchased 1-5 Gb of traffic per month but have a site
consisting of thousands if not tens of thousands of pages. We were quick to find out
that there are *many* of these folks out on the Internet. The problem is obvious. If
the crawler downloads the whole site in a shot (even with a 30 second delay) the aggregrate bandwidth usage sometimes puts that entity over their alloted limit causing
their ISP to charge them extra. Guess who's to blame in that circumstance? Although we have always adhered to a 30 second policy, which I believe is very conservative in 2004,
we still receive the you-are-hitting-our-site-to-hard type of complaints. Usually these
arise when we touch upon to many 404s and the webmaster has decided to have their web
server email them every time one is encountered.


Just thought I'd pass along some information from the trenches....

--
Christian Storm, Ph.D.
www.turnitin.com

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to