On Fri, 10 Mar 2000, Marc Slemko wrote: > Am I out of luck with most of them because most robots won't spider > URLs with "?"s in them? Any choice but to convert to using a URL > format that doesn't include a "?"? That is doable, but is obviously a big > pain.
I have a little robot that I let lose round here, which did originally spider queries, with a depth limit - I didn't wish to spider the web, but only a restricted domain/province). I recently decided to drop spidering queries since one of the sites I was indexing was set up like yours. The problem was, the query-based URL was returning a status 200 success code even though the query had failed and returned a "no such entry" message. Since my spider originally entered the site through a list of current entries, it would eventually end up with an index of not only all current entries, but all entries that had ever existed, and my disk isn't that big. This may tie in with the previous discussion on "orphan pages" - I had not thought to try deleting pages for which the referers had disappeared, but merely removed pages that no longer gave a status 200 "success" code. On my Apache server it's trivial to change some CGI url foo?id to foo/id - the parameters appear in path_info instead of query_string. I'd like to see a 404 status when there's no match, though. On a related note, I have a gripe with vendors whose servers give out status 200 for "page not found". Some Novell product comes to mind. Things were a bit simpler a few years back when almost all pages were static and people put /cgi-bin in robots.txt. Andrew Daviel TRIUMF & Vancouver Webpages
