[
https://issues.apache.org/jira/browse/NUTCH-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-1752:
-----------------------------------
Attachment: NUTCH-1752-v2.patch
Attached reviewed patch v2. Changed/fixed caching of robot rules of redirected
robots.txt:
* patch v1 introduced bug: cache key for redirect target is not properly
constructed
* for redirected cache key: use protocol and port of the redirect target. E.g.,
if https://host1/robots.txt redirects to http://host2/robots.txt: rules from
the latter are cached for "https:host1:443" and "http:host2:80".
> cache robots.txt rules per protocol:host:port
> ---------------------------------------------
>
> Key: NUTCH-1752
> URL: https://issues.apache.org/jira/browse/NUTCH-1752
> Project: Nutch
> Issue Type: Bug
> Components: protocol
> Affects Versions: 1.8, 2.2.1
> Reporter: Sebastian Nagel
> Fix For: 2.3, 1.9
>
> Attachments: NUTCH-1752-v1.patch, NUTCH-1752-v2.patch
>
>
> HttpRobotRulesParser caches rules from {{robots.txt}} per "protocol:host"
> (before NUTCH-1031 caching was per "host" only). The caching should be per
> "protocol:host:port". In doubt, a request to a different port may deliver a
> different {{robots.txt}}.
> Applying robots.txt rules to a combination of host, protocol, and port is
> common practice:
> [Norobots RFC 1996 draft|http://www.robotstxt.org/norobots-rfc.txt] does not
> mention this explicitly (could be derived from examples) but others do:
> * [Wikipedia|http://en.wikipedia.org/wiki/Robots.txt]: "each protocol and
> port needs its own robots.txt file"
> * [Google
> webmasters|https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt]:
> "The directives listed in the robots.txt file apply only to the host,
> protocol and port number where the file is hosted."
--
This message was sent by Atlassian JIRA
(v6.2#6252)