[ 
https://issues.apache.org/jira/browse/NUTCH-2754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002194#comment-17002194
 ] 

ASF GitHub Bot commented on NUTCH-2754:
---------------------------------------

sebastian-nagel commented on pull request #487: NUTCH-2754 
fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
URL: https://github.com/apache/nutch/pull/487
 
 
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> fetcher.max.crawl.delay ignored if exceeding 5 min. / 300 sec.
> --------------------------------------------------------------
>
>                 Key: NUTCH-2754
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2754
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher, robots
>    Affects Versions: 1.16
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.17
>
>
> Sites specifying a Crawl-Delay of more than 5 minutes (301 seconds or more) 
> are always ignored, even if fetcher.max.crawl.delay is set to a higher value.
> We need to pass a higher value of fetcher.max.crawl.delay to 
> [crawler-commons' robots.txt 
> parser|https://github.com/crawler-commons/crawler-commons/blob/c9c0ac6eda91b13d534e69f6da3fd15065414fb0/src/main/java/crawlercommons/robots/SimpleRobotRulesParser.java#L78]
>  otherwise it will use the internal default value of 300 sec. and disallow 
> all sites specifying a longer Crawl-Delay in their robots.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to