Sebastian Nagel created NUTCH-2990:
--------------------------------------
Summary: HttpRobotRulesParser to follow 5 redirects as specified
by RFC 9309
Key: NUTCH-2990
URL: https://issues.apache.org/jira/browse/NUTCH-2990
Project: Nutch
Issue Type: Improvement
Components: protocol, robots
Affects Versions: 1.19
Reporter: Sebastian Nagel
Fix For: 1.20
The robots.txt parser
([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
follows only one redirect when fetching the robots.txt while the robots.txt
RFC 9309 recommends to follow 5 redirects:
{quote} 2.3.1.2. Redirects
It's possible that a server responds to a robots.txt fetch request with a
redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD
follow at least five consecutive redirects, even across authorities (for
example, hosts in the case of HTTP).
If a robots.txt file is reached within five consecutive redirects, the
robots.txt file MUST be fetched, parsed, and its rules followed in the context
of the initial authority. If there are more than five consecutive redirects,
crawlers MAY assume that the robots.txt file is unavailable.
(https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
While following redirects, the parser should check whether the redirect
location is itself a "/robots.txt" on a different host and then try to read it
from the cache.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)