[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

ASF GitHub Bot (Jira) Tue, 26 Sep 2023 00:30:05 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769045#comment-17769045
 ]


ASF GitHub Bot commented on NUTCH-2990:
---------------------------------------

sebastian-nagel opened a new pull request, #779:
URL: https://github.com/apache/nutch/pull/779

   - follow multiple redirects when fetching robots.txt
   - number of followed redirects is configurable by the property 
`http.robots.redirect.max` (default: 5)
   - improvements in RobotRulesParser's robots.txt test utility
     - bug fix: the passed agent names need to be transferred to the property 
http.robots.agents earlier, before the protocol plugins are configured
     - more verbose debug logging
   
   




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> -------------------------------------------------------------------
>
>                 Key: NUTCH-2990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2990
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol, robots
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

Reply via email to