At 03:29 PM 11/10/00 +0100, Marko van der Puil wrote:
>What we could do as a community is create spiderlawenforcement.org,
>a centralized database where we keep track of spiders and how they
>index our sites.
It's an issue weekly, but hasn't become that much of a problem yet. The
bad spiders could just change IPs and user agent strings, too.
Yesterday I had 12,000 requests from a spider, but the spider added a slash
to the end of every query string so over 11,000 were invalid requests --
but the Apache log showed the requests as being a 200 (only the application
knew it was a bad request).
At this point, I'd just like to figure out how to detect them
programmatically. It seems easy to spot them as a human looking through
the logs, but less so with a program. Some spiders fake the user agent.
It probably makes sense to run a cron job every few minutes to scan the
logs and write out a file of bad IP numbers, and use mod_perl to the list
of IPs to block every 100 requests or so. I could look for lots of
requests from the same IP with a really high relation of bad requests to
good. But I'm sure it wouldn't be long before an AOL proxy got blocked.
Again, the hard part is finding a good way to detect them...
And in my experience blocking doesn't always mean the requests from that
spider stop coming ;)
Bill Moseley
mailto:[EMAIL PROTECTED]