Sebastian Nagel created NUTCH-1752:
--------------------------------------

             Summary: cache robots.txt rules per protocol:host:port
                 Key: NUTCH-1752
                 URL: https://issues.apache.org/jira/browse/NUTCH-1752
             Project: Nutch
          Issue Type: Bug
          Components: protocol
    Affects Versions: 2.2.1, 1.8
            Reporter: Sebastian Nagel
             Fix For: 2.3, 1.9


HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host" 
(before NUTCH-1031 caching was per "host" only). The caching should be per 
"protocol:host:port". In doubt, a request to a different port may deliver a 
different {{robots.txt}}. 
Applying robots.txt rules to a combination of host, protocol, and port is 
common practice:
[Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not 
mention this explicitly (could be derived from examples) but others do:
* [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and port 
needs its own robots.txt file"
* [Google 
webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
 "The directives listed in the robots.txt file apply only to the host, protocol 
and port number where the file is hosted."




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to