[ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965439#action_12965439 ]
Fuad Efendi edited comment on DROIDS-109 at 11/30/10 4:12 PM: -------------------------------------------------------------- http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5): {quote}The matching process compares every octet in the path portion of the URL and the path from the record. If a %xx encoded octet is encountered it is unencoded prior to comparison, unless it is the "/" character, which has special meaning in a path. The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered.{quote} Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4... Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character. It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/) Also, don't forget "path" in this unofficial document (1996) means really whatever is after "protocol+//+host+port"... for instance: /query;sessionID=123#My%2fAnchor?abc=123 was (Author: funtick): http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5): {quote}The matching process compares every octet in the path portion of the URL and the path from the record. If a %xx encoded octet is encountered it is unencoded prior to comparison, unless it is the "/" character, which has special meaning in a path. The match evaluates positively if and only if the end of the path from the record is reached before a difference in octets is encountered.{quote} Koster doesn't write anything about robots.txt encoding/decoding (HTTP response headers). Koster only mentions HTTP cache control in section 3.4... Logically, we need to decode path (excluding %2F) before comparison to a rule; and decoded path may contain any unicode character. It naturally means that Webmasters are allowed to use any charset in robots.txt; and we must analyze HTTP headers and decode stream accordingly, although it's not officially mentioned yet (except http://nikitathespider.com/python/rerp/) > Several defects in robots exclusion protocol (robots.txt) implementation > ------------------------------------------------------------------------ > > Key: DROIDS-109 > URL: https://issues.apache.org/jira/browse/DROIDS-109 > Project: Droids > Issue Type: Bug > Components: core, norobots > Reporter: Fuad Efendi > Original Estimate: 672h > Remaining Estimate: 672h > > 1. Googlebot and many others support query part rules; Droids currently > supports only URI.getPath() (without query part) > 2. %2F represents "/" (slash) character inside a path; it shouldn't be > decoded before applying rule > 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method > body; baseURI.getPath(); returns decoded string; then we call another > URLDecoder.decode(path, US_ASCII); > 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used! > 5. The longest matching directive path (not including wildcard expansion) > should be the one applied to any page URL > 6. Wildcard characters should be recognized > 7. Sitemaps > 8. Crawl rate > 9. BOM sequence is not removed before processing robots.txt > (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF > and most probably many more defects (Nutch & BIXO haven't done it in-full > yet). I am working on it right now... > Some references: > http://nikitathespider.com/python/rerp/ > http://en.wikipedia.org/wiki/Uniform_Resource_Identifier > http://www.searchtools.com/robots/robots-txt.html > http://en.wikipedia.org/wiki/Robots.txt > Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html > seems at least outdated... > Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996). > We need WIKI page explaining all rules implemented by Droids; hopefully it > will become unofficial standard. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.