I vote for that! Make my life about 5000 times simpler :) Marko van der Puil wrote: > Hi, > > I had the same thing, sometimes the spiders are programmed VERY sloppy. I had a > site that responed to ANY request made to its location. The mayoraty of spiders > does not understand about single and double qoutes or if you leave quotes out of > your HREF's at all. also I understand that absolute href="/bla" and relative > href="../bla" are also a problem. > > Those spiders would simply start getting urls like GET > /foo/file=1243/date=12-30-2000/name=foobar'/foo/file=1243/date=12-30-2000/name=foobar > > or > GET ../bla' > or > GET ../bla/'../bla'../bla' > aso... > > then that page would generate a page with a load of faulty links that would also > be followed. > alle HREF got built on the basis of the data that were in the requested URL. > > Then other spiders got those faulty links from eachother and soon I got more > traffic from spiders trying to index faulty links than from regular visitors. :) > > What I did was to check the input for a particular url and see if it was correct. > (should have done that in the first place.) Then I 404red the bastards.... I am > now redirecting them to the main page, which looks nicer on yer logs too. Plus > the spider might be tempted to spider yer page regularly. (most spiders drop > redirects.) You could also just return a plaintext OK. lots of nice 200's in yer > stats... > Another solution I have seen is returning a doorway page to your site. > (Searchengine SPAM!) Thats hittingthem back where it hurts. :) > > I've made remarks about this to the owners of those spiders (excite/altavista) > but I have had no satisfactory responses from them. > > What we could do as a community is create spiderlawenforcement.org, a centralized > database where we keep track of spiders and how they index our sites. We could > build a database of spiders indexed by Agent tag, those following robots.txt and > those explicitly exploiting that, or blacklist some by IP if they keep breaking > the rules. Lots of developers could use this database to block those nasty sons > of.... er well, sons of spiders I suppose. All opensourced of course, and the > data available for free, some perl modules to approach the db. Send an email to > the administrator of the spider everytime a spider tries a bad link on a member > site, and watch how fast thell fix the bl**dy things! > > Let me know if any of you are interrested in such a thing. > > Bill Moseley wrote: > > > This is slightly OT, but any solution I use will be mod_perl, of course. > > > > I'm wondering how people deal with spiders. I don't mind being spidered as > > long as it's a well behaved spider and follows robots.txt. And at this > > point I'm not concerned with the load spiders put on the server (and I know > > there are modules for dealing with load issues). > > > > But it's amazing how many are just lame in that they take perfectly good > > HREF tags and mess them up in the request. For example, every day I see > > many requests from Novell's BorderManager where they forgot to convert HTML > > entities in HREFs before making the request. > > > > Here's another example: > > > > 64.3.57.99 - "-" [04/Nov/2000:04:36:22 -0800] "GET /../../../ HTTP/1.0" 400 > > 265 "-" "Microsoft Internet Explorer/4.40.426 (Windows 95)" 5740 > > > > In the last day that IP has requested about 10,000 documents. Over half > > were 404 requests where some 404s were non-converted entities from HREFs, > > but most were just for documents that do not and have never existed on this > > site. Almost 1000 request were 400s (Bad Request like the example above). > > And I'd guess that's not really the correct user agent, either.... > > > > In general, what I'm interested in stopping are the thousands of requests > > for documents that just don't exist on the site. And to simply block the > > lame ones, since they are, well, lame. > > > > Anyway, what do you do with spiders like this, if anything? Is it even an > > issue that you deal with? > > > > Do you use any automated methods to detect spiders, and perhaps block the > > lame ones? I wouldn't want to track every IP, but seems like I could do > > well just looking at IPs that have a high proportion of 404s to 200 and > > 304s and have been requesting over a long period of time, or very frequently. > > > > The reason I'm asking is that I was asked about all the 404s in the web > > usage reports. I know I could post-process the logs before running the web > > reports, but it would be much more fun to use mod_perl to catch and block > > them on the fly. > > > > BTW -- I have blocked spiders on the fly before -- I used to have a decoy > > in robots.txt that, if followed, would add that IP to the blocked list. It > > was interesting to see one spider get caught by that trick because it took > > thousands and thousands of 403 errors before that spider got a clue that it > > was blocked on every request. > > > > Thanks, > > > > Bill Moseley > > mailto:[EMAIL PROTECTED] > > -- > Yours sincerely, > Met vriendelijke groeten, > > Marko van der Puil http://www.renesse.com > [EMAIL PROTECTED] -- Jimi Thompson Web Master L3 communications "It's the same thing we do every night, Pinky."
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]