[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769045#comment-17769045 ]
ASF GitHub Bot commented on NUTCH-2990: --------------------------------------- sebastian-nagel opened a new pull request, #779: URL: https://github.com/apache/nutch/pull/779 - follow multiple redirects when fetching robots.txt - number of followed redirects is configurable by the property `http.robots.redirect.max` (default: 5) - improvements in RobotRulesParser's robots.txt test utility - bug fix: the passed agent names need to be transferred to the property http.robots.agents earlier, before the protocol plugins are configured - more verbose debug logging > HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 > ------------------------------------------------------------------- > > Key: NUTCH-2990 > URL: https://issues.apache.org/jira/browse/NUTCH-2990 > Project: Nutch > Issue Type: Improvement > Components: protocol, robots > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > The robots.txt parser > ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) > follows only one redirect when fetching the robots.txt while the robots.txt > RFC 9309 recommends to follow 5 redirects: > {quote} 2.3.1.2. Redirects > It's possible that a server responds to a robots.txt fetch request with a > redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers > SHOULD follow at least five consecutive redirects, even across authorities > (for example, hosts in the case of HTTP). > If a robots.txt file is reached within five consecutive redirects, the > robots.txt file MUST be fetched, parsed, and its rules followed in the > context of the initial authority. If there are more than five consecutive > redirects, crawlers MAY assume that the robots.txt file is unavailable. > (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote} > While following redirects, the parser should check whether the redirect > location is itself a "/robots.txt" on a different host and then try to read > it from the cache. -- This message was sent by Atlassian Jira (v8.20.10#820010)