[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Fuad Efendi (JIRA) Tue, 30 Nov 2010 12:42:41 -0800

     [ 
https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Fuad Efendi updated DROIDS-109:
-------------------------------

    Description: 
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is 
at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently 
supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded 
before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; 
baseURI.getPath(); returns decoded string; then we call another 
URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) 
should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt 
(http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). 
I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html
http://en.wikipedia.org/wiki/Robots.txt



  was:
Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is 
at least 12 years outdated.

1. Googlebot and many others support query part rules; Droids currently 
supports only URI.getPath() (without query part)
2. %2F represents "/" (slash) character inside a path; it shouldn't be decoded 
before applying rule
3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method body; 
baseURI.getPath(); returns decoded string; then we call another 
URLDecoder.decode(path, US_ASCII);
4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
5. The longest matching directive path (not including wildcard expansion) 
should be the one applied to any page URL
6. Wildcard characters should be recognized
7. Sitemaps
8. Crawl rate
9. BOM sequence is not removed before processing robots.txt 
(http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF

and most probably many more defects (Nutch & BIXO haven't done it in-full yet). 
I am working on it right now... 


Some references:
http://nikitathespider.com/python/rerp/
http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
http://www.searchtools.com/robots/robots-txt.html





We need also to deal with HTTP response headers. For instance, to decode into 
proper charset robots.txt; to deal with expiration header; etc.
I should modify ContentLoader interface, then implementations, and subsequently 
break whole framework :) let's think...

> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html is 
> at least 12 years outdated.
> 1. Googlebot and many others support query part rules; Droids currently 
> supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be 
> decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method 
> body; baseURI.getPath(); returns decoded string; then we call another 
> URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) 
> should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt 
> (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full 
> yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Reply via email to