[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769293#comment-17769293 ]
ASF GitHub Bot commented on NUTCH-2990: --------------------------------------- sebastian-nagel commented on PR #779: URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193 > an example on hand of a robots.txt which can be fetched with >1 redirects? http://wikipedia.org/robots.txt Note: works with protocol-http, for protocol-okhttp need also to apply the fix for NUTCH-3002. Maybe as an additional note: this PR removes the secondary lookup for a lower-cased "location" header. Case-insensitive lookup of protocol metadata should be implemented on protocol level. > HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 > ------------------------------------------------------------------- > > Key: NUTCH-2990 > URL: https://issues.apache.org/jira/browse/NUTCH-2990 > Project: Nutch > Issue Type: Improvement > Components: protocol, robots > Affects Versions: 1.19 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.20 > > > The robots.txt parser > ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html]) > follows only one redirect when fetching the robots.txt while the robots.txt > RFC 9309 recommends to follow 5 redirects: > {quote} 2.3.1.2. Redirects > It's possible that a server responds to a robots.txt fetch request with a > redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers > SHOULD follow at least five consecutive redirects, even across authorities > (for example, hosts in the case of HTTP). > If a robots.txt file is reached within five consecutive redirects, the > robots.txt file MUST be fetched, parsed, and its rules followed in the > context of the initial authority. If there are more than five consecutive > redirects, crawlers MAY assume that the robots.txt file is unavailable. > (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote} > While following redirects, the parser should check whether the redirect > location is itself a "/robots.txt" on a different host and then try to read > it from the cache. -- This message was sent by Atlassian Jira (v8.20.10#820010)