Christian M. Cepel asks: | Does John's script obey robot exclusions? I'm ready to kill Altavista | for spidering my javascript validated forms, submitting them empty, and | completely ignoring robot exclusions.
Yes, it does. The first thing it does for each site is asks for the robots.txt file, and stays away from directories that have a general exclusion. The only exceptions are when someone specifically asks to have their music scanned, and then their directory becomes an "exception to the exclusion". I think this has only happened once. I also have a significant tune collection (partly from extracting tunes from lists like this one). I was given write access to the robots.txt file on the machine a couple of years ago, and it excludes most of my music stuff. I've found that the big search sites just aren't very good for finding music. And then I have to list my own directories as exceptions to the robots.txt rules, as mentioned in the previous paragraph. OTOH, if I had a collection of abc songs with lyrics, I'd probably want that searched by the big guys. They're all pretty good at finding lyrics. I know what you mean about the forms. And there's a similar problem with cgi scripts. Maybe two years ago, I started reading about research into searching for "hidden pages" on the web that can only be found via forms and scripts. My reaction to this was "Uh-oh; I'd better watch for this. About a year ago they hit. Several search sites started invoking my lookup script systematically with random-looking arguments, and whem they got a reply with a form, started exploring the links. They were, in effect, attempting to get every abc tune on the web in every format that my scripts know how to return. One of them hit our server simultaneously from about 30 different addresses, and had over 100 tune convertions outstanding. It brought the server to a screeching halt. I got enough cpu time to add a "blacklist" to my scripts, and whenever I see symptoms of this, I add their address (or subnet) to the blacklist. And I added a small (5 sec) minimum between requests from the same address. Both of these can be a hassle to people working from behind a firewall, since what my scripts see is the firewall's address, and all users behind it look like a single user. But such things are necessary when there are misbehaving search monsters out there. One of the side effects of this is that I no longer tell the mailer here to forward my email to my home machine. I log in and read the email here. This means that I'm logged in several times during most days. This is so that I can keep a constant watch for attacks on the web server. Most of these are probably not malicious; they are more likely from novice searchers. But it's a good idea to spot them fast and install defenses against the new ones. My search program also has a sort of "reverse blacklist". In its list of starting URLs, I can include URLs or hosts that are to be avoided. I've mentioned this on lists that I subscribe to, with the idea that someone might not want their tunes indexed. So far I haven't actually had anyone say they want to be avoided, but it's a possibility. I mostly use this as a way to keep the search program away from some sites that are known sinkholes of time with no abc tunes. There are some sites that have pages with millions of links, and such things are best ignored. Another thing I have my searcher do is ignore any URL with "cgi" as a token, i.e., with non-letters on both sides. This is fairly effective at preventing the invocation of scripts without arguments, and that's almost always a pure waste of time. I've also been thinking of also excluding things like "php", but so far that hasn't been necessary. You can learn a lot of weird stuff when you try writing a web search program ... To subscribe/unsubscribe, point your browser to: http://www.tullochgorm.com/lists.html
