[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

ASF GitHub Bot (Jira) Fri, 06 Oct 2023 04:11:04 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17772536#comment-17772536
 ]


ASF GitHub Bot commented on NUTCH-2990:
---------------------------------------

jnioche commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1750429119

   since you asked me to have a look at it @sebastian-nagel, it looks good to 
me! 




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> -------------------------------------------------------------------
>
>                 Key: NUTCH-2990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2990
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol, robots
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

Reply via email to