[
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772536#comment-17772536
]
ASF GitHub Bot commented on NUTCH-2990:
---------------------------------------
jnioche commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1750429119
since you asked me to have a look at it @sebastian-nagel, it looks good to
me!
> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> -------------------------------------------------------------------
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
> Issue Type: Improvement
> Components: protocol, robots
> Affects Versions: 1.19
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
> follows only one redirect when fetching the robots.txt while the robots.txt
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers
> SHOULD follow at least five consecutive redirects, even across authorities
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the
> robots.txt file MUST be fetched, parsed, and its rules followed in the
> context of the initial authority. If there are more than five consecutive
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect
> location is itself a "/robots.txt" on a different host and then try to read
> it from the cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)