On Tue, 11 Apr 2000, Aaron Swartz wrote:

> some parts. Well, the truth is that those parts include important content
> which is proper for a search engine to index it. Again, I repeat, if it's
> not important, or part of a game or some such, the Robots Exclusion Protocol
> should be used.

The problem as I see it is of filling your index with junk URLs.
For instance, a "for sale" site has a static page with "this weeks
bargains", which is highly weighted and is indexed by the robot. Links on
this page consist of database queries which initially return item
descriptions (important content) and a status 200 success code.
Unless the robot has some mechanism to drop orphan pages, the query may
be re-indexed at some later date. It may then return "item not found"
instead of the item description, but still with a status 200 code.
The robot index thus fills up with no-longer-valid queries, which, as far
as the robot is concerned, are error-free pages.

One cannot rely on website maintainers to do the correct thing  -
returning an error code for a failed query, or blocking database searches
with robots.txt ... hmmm, reminds me - I have some scripts returning
status 200 on error (parsing path, not query) I should fix.

A quick check suggests that major search engines are returning status
200 on empty content. Netscape seems to display the page content
on most status codes except 201-204.

Andrew Daviel
TRIUMF & Vancouver Webpages

Reply via email to