[ https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2573. ------------------------------------ Resolution: Implemented > Suspend crawling if robots.txt fails to fetch with 5xx status > ------------------------------------------------------------- > > Key: NUTCH-2573 > URL: https://issues.apache.org/jira/browse/NUTCH-2573 > Project: Nutch > Issue Type: Improvement > Components: fetcher > Affects Versions: 1.14 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.19 > > > Fetcher should optionally (by default) suspend crawling by a configurable > interval when fetching the robots.txt fails with a server errors (HTTP status > code 5xx, esp. 503) following [Google's spec| > https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]: > ??5xx (server error)?? > ??Server errors are seen as temporary errors that result in a "full disallow" > of crawling. The request is retried until a non-server-error HTTP result code > is obtained. A 503 (Service Unavailable) error will result in fairly frequent > retrying. To temporarily suspend crawling, it is recommended to serve a 503 > HTTP result code. Handling of a permanent server error is undefined.?? > See also the [draft robots.txt RFC, section "Unreachable > status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4]. > Crawler-commons robots rules already provide > [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--] > to store this information (must be set from RobotRulesParser). -- This message was sent by Atlassian Jira (v8.20.1#820001)