Re: Disallows in robots.txt
So: I was looking at a robots.txt file and it had a series of disallow instructions for various user agents, and then at the bottom was a full disallow: [...] Wouldn't this just disallow everyone from everything? No, it would disallow everyone but a ... d (with the specified restrictions). From the spec: The robot must obey the first record in /robots.txt that contains a User- Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive. If no such record exists, it should obey the first record with a User-agent line with a * value, if present. If no record satisfied either condition, or no records are present at all, access is unlimited. Regards, Martin
Re: Stemming and Wildcards in robots.txt files
Jonathan Knoll: User-agent: * Disallow: /cgi-bin Disallow: /site Klaus Johannes Rusch: /cgi-bin/test.cgi /siteindex.html would be excluded. But what about these paths (in the same root dir): /foo/cgi-bin/test.cgi /bar/user1/cgi-bin/test.sgi /bar/user2/cgi-bin/test.cgi Does the wildcard function recognize specified strings elsewhere (later) than in the immediate beginning of a path? The draft specification is quite clear on this: the strings are compared octet by octet until the Allow / Disallow string ends, in which case this rule matches, or until a mismatch is found. From the spec: The matching process compares every octet in the path portion of the URL and the path from the record. [...] The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered. Regards, Martin -- Sent through GMX FreeMail - http://www.gmx.net