Sebastian Nagel created NUTCH-2990:
--------------------------------------

             Summary: HttpRobotRulesParser to follow 5 redirects as specified 
by RFC 9309
                 Key: NUTCH-2990
                 URL: https://issues.apache.org/jira/browse/NUTCH-2990
             Project: Nutch
          Issue Type: Improvement
          Components: protocol, robots
    Affects Versions: 1.19
            Reporter: Sebastian Nagel
             Fix For: 1.20


The robots.txt parser 
([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
 follows only one redirect when fetching the robots.txt while the robots.txt 
RFC 9309 recommends to follow 5 redirects:

{quote} 2.3.1.2. Redirects

It's possible that a server responds to a robots.txt fetch request with a 
redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers SHOULD 
follow at least five consecutive redirects, even across authorities (for 
example, hosts in the case of HTTP).
If a robots.txt file is reached within five consecutive redirects, the 
robots.txt file MUST be fetched, parsed, and its rules followed in the context 
of the initial authority. If there are more than five consecutive redirects, 
crawlers MAY assume that the robots.txt file is unavailable.
(https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}

While following redirects, the parser should check whether the redirect 
location is itself a "/robots.txt" on a different host and then try to read it 
from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to