[ 
https://issues.apache.org/jira/browse/NUTCH-1067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088602#comment-13088602
 ] 

Julien Nioche commented on NUTCH-1067:
--------------------------------------

Looks good but 2 comments though : 
- fetcher.throughput.threshold -> rename to 
'fetcher.throughput.threshold.pages'? This way we could also introduce a 
threshold based on the bytes later?
- threshold should not be an integer but a float -> for small crawls we could 
have less than one page per second but still want to use the threshold for 
preventing things to get worse

Out of curiosity why do you put hasMore() as a separate method?

Thanks

Ju

> Configure minimum throughput for fetcher
> ----------------------------------------
>
>                 Key: NUTCH-1067
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>            Reporter: Markus Jelsma
>            Assignee: Julien Nioche
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1067-1.4-1.patch, NUTCH-1067-1.4-2.patch, 
> NUTCH-1067-1.4-3.patch
>
>
> Large fetches can contain a lot of url's for the same domain. These can be 
> very slow to crawl due to politeness from robots.txt, e.g. 10s per url. If 
> all other url's have been fetched, these queue's can stall the entire 
> fetcher, 60 url's can then take 10 minutes or even more. This can usually be 
> dealt with using the time bomb but the time bomb value is hard to determine.
> This patch adds a fetcher.throughput.threshold setting meaning the minimum 
> number of pages per second before the fetcher gives up. It doesn't use the 
> global number of pages / running time but records the actual pages processed 
> in the previous second. This value is compared with the configured threshold.
> Besides the check the fetcher's status is also updated with the actual number 
> of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to