Re: Preventing bots from starving other users?

2009-11-16 Thread Wout Mertens
Hi John,

On Nov 15, 2009, at 8:29 PM, John Lauro wrote:

 I would probably do that sort of throttling at the OS level with iptables,
 etc...

Hmmm How? I don't want to throw away the requests, just queue them. Looking 
for iptables rate limiting it seems that you can only drop the request.

Then again:

 That said, before that I would investigate why the wiki is so slow...
 Something probably isn't configured right if it chokes with only a few
 simultaneous accesses.  I mean, unless it's embedded server with under 32MB
 of RAM, the hardware should be able to handle that...

Yeah, it's running pretty old software on a pretty old server. It should be 
upgraded but that is a fair bit of work; I was hoping that a bit of 
configuration could make the situation fair again...

Thanks,

Wout.


Re: Preventing bots from starving other users?

2009-11-16 Thread Karsten Elfenbein
Just create an additional backend and assign the bots to it.
You can set queues and max connections there as needed.

Also an additional tip might be to adjust the robots.txt file as some bots can 
be slowed down.
http://www.google.com/support/webmasters/bin/answer.py?answer=48620
Check if the bots that are crawling have some real use for you, otherwise just 
adjust your robots.txt or block them.

Some stuff for basic mysql + mediawiki might be to check if the mysql 
querycache is working.

Karsten

Am Sonntag, 15. November 2009 schrieben Sie:
 Hi there,
 
 I was wondering if HAProxy helps in the following situation:
 
 - We have a wiki site which is quite slow
 - Regular users don't have many problems
 - We also get crawled by a search bot, which creates many concurrent
  connections, more than the hardware can handle - Therefore, service is
  degraded and users usually have their browsers time out on them
 
 Given that we can't make the wiki faster, I was thinking that we could
  solve this by having a per-source-IP queue, which made sure that a given
  source IP cannot have more than e.g. 3 requests active at the same time.
  Requests beyond that would get queued.
 
 Is this possible?
 
 Thanks,
 
 Wout.
 


-- 
Mit freundlichen Grüßen

Karsten Elfenbein
Entwicklung und Systemadministration

erento - Der Online-Marktplatz für Mietartikel.

erento GmbH
Friedenstrasse 91
D-10249 Berlin

Tel: +49 (30) 2000 42064
Fax: +49 (30) 2000  8499
eMail:   karsten.elfenb...@erento.com

- - - - - - - - - - - - - - - - - - - - - - - - - -
Hotline: 01805 - 373 686 (14 ct/min.)
Firmensitz der erento GmbH ist Berlin
Geschäftsführer: Chris Möller  Oliver Weyergraf
Handelsregister Berlin Charlottenburg,  HRB 101206B
- - - - - - - - - - - - - - - - - - - - - - - - - -
http://www.erento.com - alles online mieten.



RE: Preventing bots from starving other users?

2009-11-16 Thread John Lauro
Oopps, my bad...  It's actually tc and not iptables.  Googletc qdisc
for some info.

You could allow your local ips go unrestricted, and throttle all other IPs
to 512kb/sec for example.

What software is the running on?  I assume it's not running under apache or
there would be some ways to tune apache.  As other have mentioned, telling
the crawlers to behave themselves or totally ignore the wiki with a robots
file is probably best.

 -Original Message-
 From: Wout Mertens [mailto:wout.mert...@gmail.com]
 Sent: Monday, November 16, 2009 7:31 AM
 To: John Lauro
 Cc: haproxy@formilux.org
 Subject: Re: Preventing bots from starving other users?
 
 Hi John,
 
 On Nov 15, 2009, at 8:29 PM, John Lauro wrote:
 
  I would probably do that sort of throttling at the OS level with
 iptables,
  etc...
 
 Hmmm How? I don't want to throw away the requests, just queue them.
 Looking for iptables rate limiting it seems that you can only drop the
 request.
 
 Then again:
 
  That said, before that I would investigate why the wiki is so slow...
  Something probably isn't configured right if it chokes with only a
 few
  simultaneous accesses.  I mean, unless it's embedded server with
 under 32MB
  of RAM, the hardware should be able to handle that...
 
 Yeah, it's running pretty old software on a pretty old server. It
 should be upgraded but that is a fair bit of work; I was hoping that a
 bit of configuration could make the situation fair again...
 
 Thanks,
 
 Wout.
 
 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date:
 11/15/09 19:50:00




RE: Preventing bots from starving other users?

2009-11-16 Thread John Marrett
You can ask (polite) bots to throttle their request rates and
simultaneous requests. It think that you'd probably be quite interested
in the crawl-delay directive:

http://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-delay_direc
tive

This is respected by at least MSN and Yahoo. Unfortunately, it looks
like google may not (or may?) respect it, they propose this alternative:

http://www.google.com/support/webmasters/bin/answer.py?answer=48620

Of course, if you're being scraped by a bot that doesn't respect this
directive or a more malicious scraper it won't help you at all.

-JohnF

 

 -Original Message-
 From: Wout Mertens [mailto:wout.mert...@gmail.com] 
 Sent: November 16, 2009 9:19 AM
 To: John Lauro
 Cc: haproxy@formilux.org
 Subject: Re: Preventing bots from starving other users?
 
 On Nov 16, 2009, at 2:43 PM, John Lauro wrote:
 
  Oopps, my bad...  It's actually tc and not iptables.  
 Googletc qdisc
  for some info.
  
  You could allow your local ips go unrestricted, and 
 throttle all other IPs
  to 512kb/sec for example.
 
 Hmmm... The problem isn't the data rate, it's the work 
 associated with incoming requests. As soon as a 500 byte 
 request hits, the web server has to do a lot of work. 
 
  What software is the running on?  I assume it's not running 
 under apache or
  there would be some ways to tune apache.  As other have 
 mentioned, telling
  the crawlers to behave themselves or totally ignore the 
 wiki with a robots
  file is probably best.
 
 Well the web server is Apache, but surprisingly Apache 
 doesn't allow for tuning this particular case. Suppose normal 
 request traffic looks like (A are users)
 
 Time -
 
 A  A   AA  AA   AAA  AAA A
 
 With the bot this becomes
 
 ABB A A BBA BA AABB
 
 So you can see that normal users are just swamped out of 
 slots. The webserver can render about 9 pages at the same 
 time without impact, but it takes a second or more to render. 
 At first I set MaxClients to 9, which makes it so the web 
 server doesn't swap to death, but if the bots have 8 requests 
 queued up, and then another 8, and another 8, regular users 
 have no chance of decent interactivity...
 
 This may be a corner case due to slow serving, because I'm 
 having a hard time finding a way to throttle the bots. I 
 suppose that normally you'd just add servers...
 
 Wout.
 



Re: Preventing bots from starving other users?

2009-11-16 Thread German Gutierrez
Perhaps this plugin could be useful, never used, tho:

http://twiki.org/cgi-bin/view/Plugins.TWikiCacheAddOn

On Mon, Nov 16, 2009 at 11:46 AM, Wout Mertens wout.mert...@gmail.comwrote:

 On Nov 16, 2009, at 1:47 PM, Karsten Elfenbein wrote:

  Just create an additional backend and assign the bots to it.
  You can set queues and max connections there as needed.

 Yes, you're right - that's probably the best solution. I'll create an extra
 apache process on the same server that will handle the bot subnet. No extra
 hardware needed. Thanks!

 The wiki in question is TWiki - very flexible but very bad at caching what
 it does. Basically, for each page view the complete interpreter and all
 plugins get loaded.

 Wout.




-- 
Germán Gutiérrez

Infrastructure Team
OLX Inc.
Buenos Aires - Argentina
Phone: 54.11.4775.6696
Mobile: 54.911.5669.6175
Skype: errare_est
Email: germ...@olx.com

Delivering common sense since 1969 Epoch Fail!.

The Nature is not amiable; It treats impartially to all the things. The wise
person is not amiable; He treats all people impartially.


Preventing bots from starving other users?

2009-11-15 Thread Wout Mertens
Hi there,

I was wondering if HAProxy helps in the following situation:

- We have a wiki site which is quite slow
- Regular users don't have many problems
- We also get crawled by a search bot, which creates many concurrent 
connections, more than the hardware can handle
- Therefore, service is degraded and users usually have their browsers time out 
on them

Given that we can't make the wiki faster, I was thinking that we could solve 
this by having a per-source-IP queue, which made sure that a given source IP 
cannot have more than e.g. 3 requests active at the same time. Requests beyond 
that would get queued.

Is this possible?

Thanks,

Wout.


RE: Preventing bots from starving other users?

2009-11-15 Thread John Lauro
I would probably do that sort of throttling at the OS level with iptables,
etc...

That said, before that I would investigate why the wiki is so slow...
Something probably isn't configured right if it chokes with only a few
simultaneous accesses.  I mean, unless it's embedded server with under 32MB
of RAM, the hardware should be able to handle that...


 -Original Message-
 From: Wout Mertens [mailto:wout.mert...@gmail.com]
 Sent: Sunday, November 15, 2009 9:57 AM
 To: haproxy@formilux.org
 Subject: Preventing bots from starving other users?
 
 Hi there,
 
 I was wondering if HAProxy helps in the following situation:
 
 - We have a wiki site which is quite slow
 - Regular users don't have many problems
 - We also get crawled by a search bot, which creates many concurrent
 connections, more than the hardware can handle
 - Therefore, service is degraded and users usually have their browsers
 time out on them
 
 Given that we can't make the wiki faster, I was thinking that we could
 solve this by having a per-source-IP queue, which made sure that a
 given source IP cannot have more than e.g. 3 requests active at the
 same time. Requests beyond that would get queued.
 
 
 Is this possible?
 
 Thanks,
 
 Wout.
 
 No virus found in this incoming message.
 Checked by AVG - www.avg.com
 Version: 8.5.425 / Virus Database: 270.14.60/2495 - Release Date:
 11/15/09 07:50:00




Re: Preventing bots from starving other users?

2009-11-15 Thread Aleksandar Lazic

On Son 15.11.2009 15:57, Wout Mertens wrote:

Hi there,

I was wondering if HAProxy helps in the following situation:

- We have a wiki site which is quite slow
- Regular users don't have many problems
- We also get crawled by a search bot, which creates many concurrent
 connections, more than the hardware can handle
- Therefore, service is degraded and users usually have their browsers
 time out on them

Given that we can't make the wiki faster, I was thinking that we could
solve this by having a per-source-IP queue, which made sure that a
given source IP cannot have more than e.g. 3 requests active at the
same time. Requests beyond that would get queued.

Is this possible?


Maybe with http://haproxy.1wt.eu/download/1.3/doc/configuration.txt

src ip_address

fe_sess_rate

In the acl section.

Maybe you get some ideas from this
http://haproxy.1wt.eu/download/1.3/doc/haproxy-en.txt

5) Access lists


Hth

Aleks



Re: Preventing bots from starving other users?

2009-11-15 Thread Łukasz Jagiełło
2009/11/15 Wout Mertens wout.mert...@gmail.com:
 I was wondering if HAProxy helps in the following situation:

 - We have a wiki site which is quite slow
 - Regular users don't have many problems
 - We also get crawled by a search bot, which creates many concurrent 
 connections, more than the hardware can handle
 - Therefore, service is degraded and users usually have their browsers time 
 out on them

 Given that we can't make the wiki faster, I was thinking that we could solve 
 this by having a per-source-IP queue, which made sure that a given source IP 
 cannot have more than e.g. 3 requests active at the same time. Requests 
 beyond that would get queued.

 Is this possible?

Guess so. I move traffic from crawlers to special web backend cause
they mostly harvest when I got backup window and slow down everything
even more. Add request limit should be also easy. Just check docu.

-- 
Łukasz Jagiełło
System Administrator
G-Forces Web Management Polska sp. z o.o. (www.gforces.pl)

Ul. Kruczkowskiego 12, 80-288 Gdańsk
Spółka wpisana do KRS pod nr 246596 decyzją Sądu Rejonowego Gdańsk-Północ