On 7/25/06, Mike Erdely <[EMAIL PROTECTED]> wrote:
prad wrote:
> what is the best way to stop those robots and spiders from getting in?

Someone on this list (who can reveal themselves if they want) has a
pretty good setup to block "disrespectful" robots.

They have a robots.txt file that specifies a "Disallow: /somedir/".
Anyone that actually GOES into that directory gets blocked by PF.

It'd be pretty easy to parse your /var/www/logs/access_log for accesses
of "/somedir/" and have them added to a table.

-ME


Arxiv dumps massive amounts of data at you and then blocks you if you
access a special robot-trap page. See
http://arxiv.org/RobotsBeware.html.

If using a CGI/template-based or frame-based site It would not be
difficult to generate a new trap page every day and link it on all
pages for robots to fall in to. You could even make the url sound
reasonable by using a wordbank so that statistical analysis of the
characters can't pick out the real links from the trap ones. You'd
also have to make sure to move the link around (i.e. just having the
trap as the last link on every page is obvious).

However the above is probably excessive; robot authors really aren't
that unlazy (that's the whole reason they are running a robot in the
first place).

-Nick

Reply via email to