At 03:29 PM 11/10/00 +0100, Marko van der Puil wrote:
>What we could do as a community is create spiderlawenforcement.org,
>a centralized database where we keep track of spiders and how they
>index our sites.

It's an issue weekly, but hasn't become that much of a problem yet.  The
bad spiders could just change IPs and user agent strings, too.

Yesterday I had 12,000 requests from a spider, but the spider added a slash
to the end of every query string so over 11,000 were invalid requests --
but the Apache log showed the requests as being a 200 (only the application
knew it was a bad request).

At this point, I'd just like to figure out how to detect them
programmatically.  It seems easy to spot them as a human looking through
the logs, but less so with a program.  Some spiders fake the user agent.

It probably makes sense to run a cron job every few minutes to scan the
logs and write out a file of bad IP numbers, and use mod_perl to the list
of IPs to block every 100 requests or so.  I could look for lots of
requests from the same IP with a really high relation of bad requests to
good.  But I'm sure it wouldn't be long before an AOL proxy got blocked.

Again, the hard part is finding a good way to detect them...

And in my experience blocking doesn't always mean the requests from that
spider stop coming ;)




Bill Moseley
mailto:[EMAIL PROTECTED]

Reply via email to