You can ask (polite) bots to throttle their request rates and
simultaneous requests. It think that you'd probably be quite interested
in the crawl-delay directive:

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_direc
tive

This is respected by at least MSN and Yahoo. Unfortunately, it looks
like google may not (or may?) respect it, they propose this alternative:

http://www.google.com/support/webmasters/bin/answer.py?answer=48620

Of course, if you're being scraped by a bot that doesn't respect this
directive or a more malicious scraper it won't help you at all.

-JohnF

 

> -----Original Message-----
> From: Wout Mertens [mailto:[email protected]] 
> Sent: November 16, 2009 9:19 AM
> To: John Lauro
> Cc: [email protected]
> Subject: Re: Preventing bots from starving other users?
> 
> On Nov 16, 2009, at 2:43 PM, John Lauro wrote:
> 
> > Oopps, my bad...  It's actually tc and not iptables.  
> Google    tc qdisc
> > for some info.
> > 
> > You could allow your local ips go unrestricted, and 
> throttle all other IPs
> > to 512kb/sec for example.
> 
> Hmmm... The problem isn't the data rate, it's the work 
> associated with incoming requests. As soon as a 500 byte 
> request hits, the web server has to do a lot of work. 
> 
> > What software is the running on?  I assume it's not running 
> under apache or
> > there would be some ways to tune apache.  As other have 
> mentioned, telling
> > the crawlers to behave themselves or totally ignore the 
> wiki with a robots
> > file is probably best.
> 
> Well the web server is Apache, but surprisingly Apache 
> doesn't allow for tuning this particular case. Suppose normal 
> request traffic looks like (A are users)
> 
> Time ->
> 
> A  A   AA  A    A   AAA  A    AA A
> 
> With the bot this becomes
> 
> ABBBBBBBBBB A BBBBA BBA BBBBBA AABBBBBB
> 
> So you can see that normal users are just swamped out of 
> "slots". The webserver can render about 9 pages at the same 
> time without impact, but it takes a second or more to render. 
> At first I set MaxClients to 9, which makes it so the web 
> server doesn't swap to death, but if the bots have 8 requests 
> queued up, and then another 8, and another 8, regular users 
> have no chance of decent interactivity...
> 
> This may be a corner case due to slow serving, because I'm 
> having a hard time finding a way to throttle the bots. I 
> suppose that normally you'd just add servers...
> 
> Wout.
> 

Reply via email to