[ https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Fuad Efendi updated DROIDS-109: ------------------------------- Description: 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part) 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII); 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used! 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL 6. Wildcard characters should be recognized 7. Sitemaps 8. Crawl rate 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... Some references: http://nikitathespider.com/python/rerp/ http://en.wikipedia.org/wiki/Uniform_Resource_Identifier http://www.searchtools.com/robots/robots-txt.html http://en.wikipedia.org/wiki/Robots.txt Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated... Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996). We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard. Recent update from Google: *http://code.google.com/web/controlcrawlindex/* was: 1. Googlebot and many others support query part rules; Droids currently supports only URI.getPath() (without query part) 2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded before applying rule 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; baseURI.getPath(); returns decoded string; then we call another URLDecoder.decode(path, US_ASCII); 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used! 5. The longest matching directive path (not including wildcard expansion) should be the one applied to any page URL 6. Wildcard characters should be recognized 7. Sitemaps 8. Crawl rate 9. BOM sequence is not removed before processing robots.txt (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF and most probably many more defects (Nutch & BIXO haven't done it in-full yet). I am working on it right now... Some references: http://nikitathespider.com/python/rerp/ http://en.wikipedia.org/wiki/Uniform_Resource_Identifier http://www.searchtools.com/robots/robots-txt.html http://en.wikipedia.org/wiki/Robots.txt Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html seems at least outdated... Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996). We need WIKI page explaining all rules implemented by Droids; hopefully it will become unofficial standard. Recent update from Google: {b}http://code.google.com/web/controlcrawlindex/{b} > Several defects in robots exclusion protocol (robots.txt) implementation > ------------------------------------------------------------------------ > > Key: DROIDS-109 > URL: https://issues.apache.org/jira/browse/DROIDS-109 > Project: Droids > Issue Type: Bug > Components: core, norobots > Affects Versions: Graduating from the Incubator > Reporter: Fuad Efendi > Fix For: Graduating from the Incubator > > Original Estimate: 672h > Remaining Estimate: 672h > > 1. Googlebot and many others support query part rules; Droids currently > supports only URI.getPath() (without query part) > 2. %2F represents "/" (slash) character inside a path; it shouldn't be > decoded before applying rule > 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method > body; baseURI.getPath(); returns decoded string; then we call another > URLDecoder.decode(path, US_ASCII); > 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used! > 5. The longest matching directive path (not including wildcard expansion) > should be the one applied to any page URL > 6. Wildcard characters should be recognized > 7. Sitemaps > 8. Crawl rate > 9. BOM sequence is not removed before processing robots.txt > (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF > and most probably many more defects (Nutch & BIXO haven't done it in-full > yet). I am working on it right now... > Some references: > http://nikitathespider.com/python/rerp/ > http://en.wikipedia.org/wiki/Uniform_Resource_Identifier > http://www.searchtools.com/robots/robots-txt.html > http://en.wikipedia.org/wiki/Robots.txt > Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html > seems at least outdated... > Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996). > We need WIKI page explaining all rules implemented by Droids; hopefully it > will become unofficial standard. > Recent update from Google: > *http://code.google.com/web/controlcrawlindex/* -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.