[jira] [Updated] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

Sebastian Nagel (Jira) Fri, 14 Jan 2022 07:57:44 -0800


     [ 
https://issues.apache.org/jira/browse/NUTCH-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel updated NUTCH-2573:
-----------------------------------
    Description: 
Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

See also the [draft robots.txt RFC, section "Unreachable 
status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].

Crawler-commons robots rules already provide 
[isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (must be set from RobotRulesParser).

  was:
Fetcher should optionally (by default) suspend crawling by a configurable 
interval when fetching the robots.txt fails with a server errors (HTTP status 
code 5xx, esp. 503) following [Google's spec| 
https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
??5xx (server error)??
??Server errors are seen as temporary errors that result in a "full disallow" 
of crawling. The request is retried until a non-server-error HTTP result code 
is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
retrying. To temporarily suspend crawling, it is recommended to serve a 503 
HTTP result code. Handling of a permanent server error is undefined.??

Crawler-commons robots rules already provide 
[isDeverVisitis|http://crawler-commons.github.io/crawler-commons/0.9/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
 to store this information (must be set from RobotRulesParser).


> Suspend crawling if robots.txt fails to fetch with 5xx status
> -------------------------------------------------------------
>
>                 Key: NUTCH-2573
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2573
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 1.14
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.19
>
>
> Fetcher should optionally (by default) suspend crawling by a configurable 
> interval when fetching the robots.txt fails with a server errors (HTTP status 
> code 5xx, esp. 503) following [Google's spec| 
> https://developers.google.com/search/reference/robots_txt#handling-http-result-codes]:
> ??5xx (server error)??
> ??Server errors are seen as temporary errors that result in a "full disallow" 
> of crawling. The request is retried until a non-server-error HTTP result code 
> is obtained. A 503 (Service Unavailable) error will result in fairly frequent 
> retrying. To temporarily suspend crawling, it is recommended to serve a 503 
> HTTP result code. Handling of a permanent server error is undefined.??
> See also the [draft robots.txt RFC, section "Unreachable 
> status"|https://datatracker.ietf.org/doc/html/draft-koster-rep-06#section-2.3.1.4].
> Crawler-commons robots rules already provide 
> [isDeferVisits|https://crawler-commons.github.io/crawler-commons/1.2/crawlercommons/robots/BaseRobotRules.html#isDeferVisits--]
>  to store this information (must be set from RobotRulesParser).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (NUTCH-2573) Suspend crawling if robots.txt fails to fetch with 5xx status

Reply via email to