[jira] Issue Comment Edited: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Fuad Efendi (JIRA) Tue, 30 Nov 2010 13:14:33 -0800

    [ 
https://issues.apache.org/jira/browse/DROIDS-109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12965439#action_12965439
 ]


Fuad Efendi edited comment on DROIDS-109 at 11/30/10 4:12 PM:
--------------------------------------------------------------

http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 5):
{quote}The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.{quote}

Koster doesn't write anything about robots.txt encoding/decoding (HTTP response 
headers). Koster only mentions HTTP cache control in section 3.4...

Logically, we need to decode path (excluding %2F) before comparison to a rule; 
and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in 
robots.txt; and we must analyze HTTP headers and decode stream accordingly, 
although it's not officially mentioned yet (except 
http://nikitathespider.com/python/rerp/)

Also, don't forget "path" in this unofficial document (1996) means really 
whatever is after "protocol+//+host+port"... for instance:
/query;sessionID=123#My%2fAnchor?abc=123




      was (Author: funtick):
    http://www.robotstxt.org/norobots-rfc.txt (draft-koster-robots-00.txt, page 
5):
{quote}The matching process compares every octet in the path portion of
   the URL and the path from the record. If a %xx encoded octet is
   encountered it is unencoded prior to comparison, unless it is the
   "/" character, which has special meaning in a path. The match
   evaluates positively if and only if the end of the path from the
   record is reached before a difference in octets is encountered.{quote}

Koster doesn't write anything about robots.txt encoding/decoding (HTTP response 
headers). Koster only mentions HTTP cache control in section 3.4...

Logically, we need to decode path (excluding %2F) before comparison to a rule; 
and decoded path may contain any unicode character.

It naturally means that Webmasters are allowed to use any charset in 
robots.txt; and we must analyze HTTP headers and decode stream accordingly, 
although it's not officially mentioned yet (except 
http://nikitathespider.com/python/rerp/)




  
> Several defects in robots exclusion protocol (robots.txt) implementation
> ------------------------------------------------------------------------
>
>                 Key: DROIDS-109
>                 URL: https://issues.apache.org/jira/browse/DROIDS-109
>             Project: Droids
>          Issue Type: Bug
>          Components: core, norobots
>            Reporter: Fuad Efendi
>   Original Estimate: 672h
>  Remaining Estimate: 672h
>
> 1. Googlebot and many others support query part rules; Droids currently 
> supports only URI.getPath() (without query part)
> 2. %2F represents "/" (slash) character inside a path; it shouldn't be 
> decoded before applying rule
> 3. Double decoding is used by NoRobotClient.isUrlAllowed(URI uri) (method 
> body; baseURI.getPath(); returns decoded string; then we call another 
> URLDecoder.decode(path, US_ASCII);
> 4. URLDecoder.decode(path, US_ASCII); - UTF-8 must be used!
> 5. The longest matching directive path (not including wildcard expansion) 
> should be the one applied to any page URL
> 6. Wildcard characters should be recognized
> 7. Sitemaps
> 8. Crawl rate
> 9. BOM sequence is not removed before processing robots.txt 
> (http://unicode.org/faq/utf_bom.html, bytes: 0xEF 0xBB 0xBF
> and most probably many more defects (Nutch & BIXO haven't done it in-full 
> yet). I am working on it right now... 
> Some references:
> http://nikitathespider.com/python/rerp/
> http://en.wikipedia.org/wiki/Uniform_Resource_Identifier
> http://www.searchtools.com/robots/robots-txt.html
> http://en.wikipedia.org/wiki/Robots.txt
> Referenced (even by Google!) http://www.robotstxt.org/wc/norobots-rfc.html 
> seems at least outdated...
> Proper reference: http://www.robotstxt.org/norobots-rfc.txt (1996).
> We need WIKI page explaining all rules implemented by Droids; hopefully it 
> will become unofficial standard.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (DROIDS-109) Several defects in robots exclusion protocol (robots.txt) implementation

Reply via email to