[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

ASF GitHub Bot (Jira) Tue, 26 Sep 2023 10:20:08 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17769293#comment-17769293
 ]


ASF GitHub Bot commented on NUTCH-2990:
---------------------------------------

sebastian-nagel commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193

   >  an example on hand of a robots.txt which can be fetched with >1 redirects?
   
   http://wikipedia.org/robots.txt
   
   Note: works with protocol-http, for protocol-okhttp need also to apply the 
fix for NUTCH-3002.
   
   Maybe as an additional note: this PR removes the secondary lookup for a 
lower-cased "location" header. Case-insensitive lookup of protocol metadata 
should be implemented on protocol level.




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> -------------------------------------------------------------------
>
>                 Key: NUTCH-2990
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2990
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol, robots
>    Affects Versions: 1.19
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

Reply via email to