I'm adding robots@mccmedia.com to this dicussion. That is the classic
list for robots.txt discussion.

Robots list: this is a discussion about the interactions of /robots.txt
and clients or robots that fetch RSS feeds. "Atom" is a new format in
the RSS family.

--On August 26, 2005 8:39:59 PM +1000 Eric Scheid <[EMAIL PROTECTED]> wrote:

> While true that each of these scenarios involve crawling new links,
> the base principle at stake is to prevent harm caused by automatic or
> robotic behaviour. That can include extremely frequent periodic re-fetching,
> a scenario which didn't really exist when robots.txt was first put together.

It was a problem then:

   In 1993 and 1994 there have been occasions where robots have visited WWW
   servers where they weren't welcome for various reasons. Sometimes these
   reasons were robot specific, e.g. certain robots swamped servers with
   rapid-fire requests, or retrieved the same files repeatedly. In other
   situations robots traversed parts of WWW servers that weren't suitable,
   e.g. very deep virtual trees, duplicated information, temporary information,
   or cgi-scripts with side-effects (such as voting).
       <http://www.robotstxt.org/wc/norobots.html>

I see /robots.txt as a declaration by the publisher (webmaster) that
robots are not welcome at those URLs. 

Web robots do not solely depend on automatic link discovery, and haven't
for at least ten years. Infoseek had a public "Add URL" page. /robots.txt
was honored regardless of whether the link was manually added or automatically
discovered.

A crawling service (robot) should warn users that the URL, Atom or otherwise,
is disallowed by robots.txt. Report that on the status page for that feed.

wunder
--
Walter Underwood
Principal Software Architect, Verity
_______________________________________________
Robots mailing list
Robots@mccmedia.com
http://www.mccmedia.com/mailman/listinfo/robots

Reply via email to