Re: [Robots] Another approach

Sean 'Captain Napalm' Conner Sun, 11 Jan 2004 22:41:44 -0800

It was thus said that the Great Walter Underwood once stated:
> 
> --On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL 
> PROTECTED]> wrote:
> > 
> > And there you go.  Using the different directives makes it backwards
> > compatible with the original robots.txt (where an older robot will ignore
> > the new directives) and without overloading the meaning of existing
> > directives (one of the downpoints of my own proposed extention).
> 
> No it does not make it backwards compatible. It makes it an
> illegal robots.txt file. Parsers built to ignore unknown directives
> would still be able to use it. Parsers not built that way would
> not be able to parse the file, and would probably miss all the
> legal directives as well as the non-standard ones.


  That would be a pretty poor parser, and besides, from the spec itself
(http://www.robotstxt.org/wc/norobots.html):

        The file consists of one or more records separated by one or more blank
        lines (terminated by CR,CR/NL, or NL). Each record contains lines of the
        form "<field>:<optionalspace><value><optionalspace>". The field name is case
        insensitive.

        Comments can be included in file using UNIX bourne shell conventions: the
        '#' character is used to indicate that preceding space (if any) and the
        remainder of the line up to the line termination is discarded. Lines
        containing only a comment are discarded completely, and therefore do not
        indicate a record boundary.

        The record starts with one or more User-agent lines, followed by one or more
        Disallow lines, as detailed below. Unrecognised headers are ignored.

  Right there---last line---"Unrecognised headers are ignored."  Besides,
it's a bit more work to *not* ignore unrecognized directives than it is to
ignore them:

        while(fgets(line,sizeof(line),fprobots) != NULL)
        {
          if (strncasecmp(line,"user-agent:",11) == 0)
                /* we have a user-agent */
          else if (strncasecmp(line,"disallow:",9) == 0)
                /* we have a disallow */
          
          /* we have a comment, or an unrecognized directive, ignore */
        }

> I mentioned the internet robustness principle before, but folks
> seem to have missed that. It is:
> 
>   Be conservative in what you send, liberal in what you accept.
> 
> In our case, the contents of the robots.txt file is "sent".
> By the robustness principle, we must not add extra stuff on
> the assumption that the parsers can deal with it.

  By the same token, it is the robots.txt parser that "accepts" the
robots.txt file, so by the robustness principle, you need to ignore
directives you don't understand.

  -spc (And be thankful Martijn didn't decide to use RFC-822 style
        header lines ... )

_______________________________________________
Robots mailing list
[EMAIL PROTECTED]
http://www.mccmedia.com/mailman/listinfo/robots

Re: [Robots] Another approach

Reply via email to