It was thus said that the Great Walter Underwood once stated: > > --On Sunday, January 11, 2004 8:13 PM -0500 Sean 'Captain Napalm' Conner <[EMAIL > PROTECTED]> wrote: > > > > And there you go. Using the different directives makes it backwards > > compatible with the original robots.txt (where an older robot will ignore > > the new directives) and without overloading the meaning of existing > > directives (one of the downpoints of my own proposed extention). > > No it does not make it backwards compatible. It makes it an > illegal robots.txt file. Parsers built to ignore unknown directives > would still be able to use it. Parsers not built that way would > not be able to parse the file, and would probably miss all the > legal directives as well as the non-standard ones.
That would be a pretty poor parser, and besides, from the spec itself (http://www.robotstxt.org/wc/norobots.html): The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form "<field>:<optionalspace><value><optionalspace>". The field name is case insensitive. Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to indicate that preceding space (if any) and the remainder of the line up to the line termination is discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a record boundary. The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored. Right there---last line---"Unrecognised headers are ignored." Besides, it's a bit more work to *not* ignore unrecognized directives than it is to ignore them: while(fgets(line,sizeof(line),fprobots) != NULL) { if (strncasecmp(line,"user-agent:",11) == 0) /* we have a user-agent */ else if (strncasecmp(line,"disallow:",9) == 0) /* we have a disallow */ /* we have a comment, or an unrecognized directive, ignore */ } > I mentioned the internet robustness principle before, but folks > seem to have missed that. It is: > > Be conservative in what you send, liberal in what you accept. > > In our case, the contents of the robots.txt file is "sent". > By the robustness principle, we must not add extra stuff on > the assumption that the parsers can deal with it. By the same token, it is the robots.txt parser that "accepts" the robots.txt file, so by the robustness principle, you need to ignore directives you don't understand. -spc (And be thankful Martijn didn't decide to use RFC-822 style header lines ... ) _______________________________________________ Robots mailing list [EMAIL PROTECTED] http://www.mccmedia.com/mailman/listinfo/robots
