[jira] [Commented] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

Hudson (Jira) Mon, 23 Dec 2019 04:05:09 -0800


    [ 
https://issues.apache.org/jira/browse/NUTCH-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002237#comment-17002237
 ]


Hudson commented on NUTCH-2754:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch-trunk #3658 (See 
[https://builds.apache.org/job/Nutch-trunk/3658/])
NUTCH-2754 fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 
(sebastian: 
[https://github.com/apache/nutch/commit/4c74bcece7f743a4ec008550f709c259317c5aa4])
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java


> fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
> --------------------------------------------------------------
>
>                 Key: NUTCH-2754
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2754
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) 
> are always ignored, even if fetcher.max.crawl.delay is set to a higher value.
> We need to pass a higher value of fetcher.max.crawl.delay to 
> [crawler-commons' robots.txt 
> parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78]
>  otherwise it will use the internal default value of 300 sec. and disallow 
> all sites specifying a longer Crawl-Delay in their robots.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (NUTCH-2754) fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.

Reply via email to