Stephen Sutherland wrote:
> How do you guys create your web crawler in such a way
> that it would step over bot bait pages like WSPosion?
>
> Do you simply include them in a list of urls to avoid
> ?
If it's a large crawl then this sort of manual involvement is
untenable. Mind you, there might be considerable mileage in
not crawling or at least massively lowering the priority of
URLs matching "*wpoison*".
In this particular case I'd suggest fetching the page twice
in relatively short succession. wpoison generated pages will
be massively different -- a sure sign that the web site is
fooling with you. And you wouldn't have to fetch every page twice.
Since wpoison conveniently generates a practically infinite URL
space, just pick how often you want to check pages and you'll catch
the shifty pages within N attempts. I've pondered this heuristic
in the general case as a simple method for spotting the variable
part of any web page.
Another idea is to flag "directories" that have too many entries.
The one wpoison script I looked at appeared to generate pages like this:
.../wpoison/random-word-1
.../wpoison/random-word-2
.../wpoison/random-word-3
and so on. A robot could noice that the ".../wpoison" directory is
very large and therefore should be dropped or at least lowered in
retrieval priority. And, again, we might generally believe that
large directories are a bad sign (e.g., log directories).
You may be able to detect traps based on content, but that seems a
little dodgy.
I'd hope there are more sophisticated spider traps out there. wpoison
could stand many improvements. Even just seeding the random number
generator based on a hash of $PATH_INFO would be a big help. Not to
mention that the default installation is a huge tip-off. I guess if
you're really serious you stick these bogus e-mail addresses into your
normal web pages but as invisible text.
-- George
--
This message was sent by the Internet robots and spiders discussion list
([EMAIL PROTECTED]). For list server commands, send "help" in the body of a message
to "[EMAIL PROTECTED]".