[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

ASF GitHub Bot (Jira) Mon, 17 Jan 2022 11:42:04 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477379#comment-17477379
 ]


ASF GitHub Bot commented on NUTCH-2573:
---------------------------------------

lewismc commented on pull request #724:
URL: https://github.com/apache/nutch/pull/724#issuecomment-1014841066


   Looks like it failed on Javadoc generation @sebastian-nagel 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
>                 Key: NUTCH-2573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable 
> interval when fetching the robots.txt fails with a server errors (HTTP status 
> code 5xx, esp. 503) following [Google's spec| 
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" 
> of crawling. The request is retried until a non-server-error HTTP result code 
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
> retrying. To temporarily suspend crawling, it is recommended to serve a 503 
> HTTP result code. Handling of a permanent server error is undefined.??
> See also the [draft robots.txt RFC, section "Unreachable 
> status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].
> Crawler-commons robots rules already provide 
> [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
>  to store this information (must be set from RobotRulesParser).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

Reply via email to