On Fri, 10 Mar 2000, Marc Slemko wrote:

> Am I out of luck with most of them because most robots won't spider
> URLs with "?"s in them?  Any choice but to convert to using a URL
> format that doesn't include a "?"?  That is doable, but is obviously a big
> pain.

I have a little robot that I let lose round here, which did originally
spider queries, with a depth limit - I didn't wish to spider the web, but
only a restricted domain/province). I recently decided to drop spidering
queries since one of the sites I was indexing was set up like yours. The
problem was, the query-based URL was returning a status 200 success code
even though the query had failed and returned a "no such entry" message.

Since my spider originally entered the site
through a list of current entries, it would eventually end up with an
index of not only all current entries, but all entries that had ever
existed, and my disk isn't that big.

This may tie in with the previous discussion on "orphan pages"  - I had
not thought to try deleting pages for which the referers had disappeared,
but merely removed pages that no longer gave a status 200 "success" code.

On my Apache server it's trivial to change some CGI url foo?id to foo/id -
the parameters appear in path_info instead of query_string. I'd like
to  see a 404 status when there's no match, though.

On a related note, I have a gripe with vendors whose servers give out
status 200 for "page not found". Some Novell product comes to mind.
Things were a bit simpler a few years back when almost all pages were
static and people put /cgi-bin in robots.txt.

Andrew Daviel
TRIUMF & Vancouver Webpages

Reply via email to