Re: Dealing with spiders

Jimi Thompson Tue, 21 Nov 2000 07:31:40 -0800

I vote for that!  Make my life about 5000 times simpler :)

Marko van der Puil wrote:

> Hi,
>
> I had the same thing, sometimes the spiders are programmed VERY sloppy. I had a
> site that responed to ANY request made to its location. The mayoraty of spiders
> does not understand about single and double qoutes or if you leave quotes out of
> your HREF's at all. also I understand that absolute href="/bla" and relative
> href="../bla" are also a problem.
>
> Those spiders would simply start getting urls like GET
> /foo/file=1243/date=12-30-2000/name=foobar'/foo/file=1243/date=12-30-2000/name=foobar
>
> or
> GET ../bla'
> or
> GET ../bla/'../bla'../bla'
> aso...
>
> then that page would generate a page with a load of faulty links that would also
> be followed.
> alle HREF got built on the basis of the data that were in the requested URL.
>
> Then other spiders got those faulty links from eachother and soon I got more
> traffic from spiders trying to index faulty links than from regular visitors. :)
>
> What I did was to check the input for a particular url and see if it was correct.
> (should have done that in the first place.) Then I 404red the bastards.... I am
> now redirecting them to the main page, which looks nicer on yer logs too. Plus
> the spider might be tempted to spider yer page regularly. (most spiders drop
> redirects.) You could also just return a plaintext OK. lots of nice 200's in yer
> stats...
> Another solution I have seen is returning a doorway page to your site.
> (Searchengine SPAM!) Thats hittingthem back where it hurts. :)
>
> I've made remarks about this to the owners of those spiders (excite/altavista)
> but I have had no satisfactory responses from them.
>
> What we could do as a community is create spiderlawenforcement.org, a centralized
> database where we keep track of spiders and how they index our sites. We could
> build a database of spiders indexed by Agent tag, those following robots.txt and
> those explicitly exploiting that, or blacklist some by IP if they keep breaking
> the rules. Lots of developers could use this database to block those nasty sons
> of.... er well, sons of spiders I suppose. All opensourced of course, and the
> data available for free, some perl modules to approach the db. Send an email to
> the administrator of the spider everytime a spider tries a bad link on a member
> site, and watch how fast thell fix the bl**dy things!
>
> Let me know if any of you are interrested in such a thing.
>
> Bill Moseley wrote:
>
> > This is slightly OT, but any solution I use will be mod_perl, of course.
> >
> > I'm wondering how people deal with spiders.  I don't mind being spidered as
> > long as it's a well behaved spider and follows robots.txt.  And at this
> > point I'm not concerned with the load spiders put on the server (and I know
> > there are modules for dealing with load issues).
> >
> > But it's amazing how many are just lame in that they take perfectly good
> > HREF tags and mess them up in the request.  For example, every day I see
> > many requests from Novell's BorderManager where they forgot to convert HTML
> > entities in HREFs before making the request.
> >
> > Here's another example:
> >
> > 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400
> > 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740
> >
> > In the last day that IP has requested about 10,000 documents.  Over half
> > were 404 requests where some 404s were non-converted entities from HREFs,
> > but most were just for documents that do not and have never existed on this
> > site.  Almost 1000 request were 400s (Bad Request like the example above).
> > And I'd guess that's not really the correct user agent, either....
> >
> > In general, what I'm interested in stopping are the thousands of requests
> > for documents that just don't exist on the site.  And to simply block the
> > lame ones, since they are, well, lame.
> >
> > Anyway, what do you do with spiders like this, if anything?  Is it even an
> > issue that you deal with?
> >
> > Do you use any automated methods to detect spiders, and perhaps block the
> > lame ones?  I wouldn't want to track every IP, but seems like I could do
> > well just looking at IPs that have a high proportion of 404s to 200 and
> > 304s and have been requesting over a long period of time, or very frequently.
> >
> > The reason I'm asking is that I was asked about all the 404s in the web
> > usage reports.  I know I could post-process the logs before running the web
> > reports, but it would be much more fun to use mod_perl to catch and block
> > them on the fly.
> >
> > BTW -- I have blocked spiders on the fly before -- I used to have a decoy
> > in robots.txt that, if followed, would add that IP to the blocked list.  It
> > was interesting to see one spider get caught by that trick because it took
> > thousands and thousands of 403 errors before that spider got a clue that it
> > was blocked on every request.
> >
> > Thanks,
> >
> > Bill Moseley
> > mailto:[EMAIL PROTECTED]
>
> --
> Yours sincerely,
> Met vriendelijke groeten,
>
> Marko van der Puil http://www.renesse.com
>    [EMAIL PROTECTED]

--
Jimi Thompson
Web Master
L3 communications

"It's the same thing we do every night, Pinky."

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Dealing with spiders

Reply via email to