[jira] [Created] (NUTCH-1067) Configure minimum throughput for fetcher

Markus Jelsma (JIRA) Fri, 22 Jul 2011 07:33:20 -0700

Configure minimum throughput for fetcher
----------------------------------------


                 Key: NUTCH-1067
                 URL: https://issues.apache.org/jira/browse/NUTCH-1067
             Project: Nutch
          Issue Type: New Feature
          Components: generator
            Reporter: Markus Jelsma
            Assignee: Markus Jelsma
            Priority: Minor
             Fix For: 1.4, 2.0
         Attachments: NUTCH-1067-1.4-1.patch

Large fetches can contain a lot of url's for the same domain. These can be very 
slow to crawl due to politeness from robots.txt, e.g. 10s per url. If all other 
url's have been fetched, these queue's can stall the entire fetcher, 60 url's 
can then take 10 minutes or even more. This can usually be dealt with using the 
time bomb but the time bomb value is hard to determine.

This patch adds a fetcher.throughput.threshold setting meaning the minimum 
number of pages per second before the fetcher gives up. It doesn't use the 
global number of pages / running time but records the actual pages processed in 
the previous second. This value is compared with the configured threshold.

Besides the check the fetcher's status is also updated with the actual number 
of pages per second and bytes per second.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (NUTCH-1067) Configure minimum throughput for fetcher

Reply via email to