There are gobs of abusive searchbots out there. I won't name names. Look at your logs and estimate hits/referrals. Not to mention email harvesters and college kids sitting behind an OC3, having nothing better to do with their dorm room computers than run a recursive wget on your site every ten minutes. Never mind that most of the material never changes once it is up.
On the other hand, watch out for large ISPs who reach out to the world through a small pool of IP addresses. Apache can be configured to block on browser match, which will get rid of a lot of those you don't want. It can also block specific IP numbers. Read your logs, look at your web page stats, figure out who you don't need, and go to it. For the individual irresponsible individuals you'll need some sort of rate throttling. Re Google: in my experience they are VERY responsible about their web page crawling, seldom exceeding one hit per couple of minutes on our sites, even considering that they crawl simultaneously from several IP numbers. Generally they're among our top ten or fifteen referrers, while accounting on any time horizon for less than 0.2% of our total kbytes. They do a great job. On Mon, Mar 25, 2002 at 05:39:31PM -0500, Jon Carnes wrote: > Apache has some nice tools for limiting the number of connections on a > specific resource. I believe that you can throttle on both requested > resource and on destination IP. You probably want to cap the amount of > connections from any single ip address to something like 20. > > There are also some nice firewall scripts that do the same thing, but more > drastically. Once the number of connections from a single IP reaches a > defined number, the offending ip is denied access for 5 minutes (or whatever > time period you want). > > Good Luck. Let us know what you come up with. > ----- Original Message ----- > From: "kellan" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Monday, March 25, 2002 5:05 PM > Subject: [Mailman-Users] CGI alternative? > > > > Hi, I'm part of a team that works on maintaining lists.indymedia.org, we > > have a very large number of lists, a lot of traffic, and seemingly a lot > > of interest for robots, particularily badly behaved ones. > > > > Several times lately the server has started choaking and dieing, with > > loads of 70+ in response to some bot hitting all the listinfo pages, and > > firing up dozens and dozens of CGI processes. > > > > I was wondering if anyone else has struggled with this issue. We don't > > want to simply block all robots, google can be essential to finding old > > posts on some of the high traffic lists. So I'm looking for less > > resource intensive solutions then CGI. > > > > Has anyone setup the Mailman CGIs under FastCGI, or mod_snake or > > something? How did that go? > > > > Also considering simply sticking Squid in front of it, anybody have tips > > on a Mailman friendly Squid config would be great. > > > > Am I missing an obvious solution? > > > > Thanks > > Kellan > > > > > ------------------------------------------------------ > Mailman-Users mailing list > [EMAIL PROTECTED] > http://mail.python.org/mailman/listinfo/mailman-users > Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py > -- ----------------------------------------------------------------- Dan Wilder <[EMAIL PROTECTED]> Technical Manager & Editor SSC, Inc. P.O. Box 55549 Phone: 206-782-8808 Seattle, WA 98155-0549 URL http://embedded.linuxjournal.com/ ----------------------------------------------------------------- ------------------------------------------------------ Mailman-Users mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py
