Christian M. Cepel asks:
| Does John's script obey robot exclusions?  I'm ready to kill Altavista
| for spidering my javascript validated forms, submitting them empty, and
| completely ignoring robot exclusions.

Yes, it does.  The first thing it does for each site is asks for  the
robots.txt  file, and stays away from directories that have a general
exclusion.  The only exceptions are when someone specifically asks to
have  their  music  scanned,  and  then  their  directory  becomes an
"exception to the exclusion". I think this has only happened once.

I also have a significant tune  collection  (partly  from  extracting
tunes  from  lists  like  this one).  I was given write access to the
robots.txt file on the machine a couple of years ago, and it excludes
most  of  my  music stuff.  I've found that the big search sites just
aren't very good for finding music.  And then I have to list  my  own
directories  as  exceptions  to the robots.txt rules, as mentioned in
the previous paragraph.

OTOH, if I had a collection of abc songs with  lyrics,  I'd  probably
want  that  searched  by  the  big  guys.  They're all pretty good at
finding lyrics.

I know what you mean about the forms.  And there's a similar  problem
with  cgi  scripts.   Maybe  two  years  ago, I started reading about
research into searching for "hidden pages" on the web that  can  only
be  found via forms and scripts.  My reaction to this was "Uh-oh; I'd
better watch for this.  About a year ago they  hit.   Several  search
sites   started   invoking   my  lookup  script  systematically  with
random-looking arguments, and whem they got  a  reply  with  a  form,
started exploring the links.  They were, in effect, attempting to get
every abc tune on the web in every format that my scripts know how to
return.   One  of  them  hit  our server simultaneously from about 30
different addresses, and had over 100 tune convertions outstanding. It
brought the server to a screeching halt.

I got enough cpu time  to  add  a  "blacklist"  to  my  scripts,  and
whenever  I  see symptoms of this, I add their address (or subnet) to
the blacklist.  And I added a small (5 sec) minimum between  requests
from  the  same  address.   Both  of  these can be a hassle to people
working from behind a firewall, since what  my  scripts  see  is  the
firewall's  address, and all users behind it look like a single user.
But such things are  necessary  when  there  are  misbehaving  search
monsters out there.

One of the side effects of this is that I no longer tell  the  mailer
here  to  forward my email to my home machine.  I log in and read the
email here.  This means that I'm logged in several times during  most
days.  This is so that I can keep a constant watch for attacks on the
web server.  Most of these are probably not malicious; they are  more
likely from novice searchers.  But it's a good idea to spot them fast
and install defenses against the new ones.

My search program also has a sort of "reverse blacklist". In its list
of starting URLs, I can include URLs or hosts that are to be avoided.
I've mentioned this on lists that I subscribe to, with the idea  that
someone might not want their tunes indexed. So far I haven't actually
had anyone say they want to be avoided, but it's  a  possibility.   I
mostly  use  this  as a way to keep the search program away from some
sites that are known sinkholes of time with no abc tunes.  There  are
some  sites  that  have pages with millions of links, and such things
are best ignored.

Another thing I have my searcher do is ignore any URL with "cgi" as a
token, i.e., with non-letters on both sides. This is fairly effective
at preventing the invocation of scripts without arguments, and that's
almost  always a pure waste of time.  I've also been thinking of also
excluding things like "php", but so far that hasn't been necessary.

You can learn a lot of weird stuff when you try writing a web  search
program ...

To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html

Reply via email to