[
https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972984#comment-13972984
]
Julien Nioche commented on NUTCH-207:
-------------------------------------
Am starting to think that the cleanest way to implement this would be to make
some radical changes to the way the Fetcher works and use the Executor
framework. The ThreadPoolExecutor is quite a nice fit for that as it defines a
max number of threads to use but would require changing the logic in the
Fetcher and get the queues to push the tasks to the Executor instead of having
the FetcherThreads polling them for work. Will probably open a new issue for
this.
> Bandwidth target for fetcher rather than a thread count
> -------------------------------------------------------
>
> Key: NUTCH-207
> URL: https://issues.apache.org/jira/browse/NUTCH-207
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher
> Affects Versions: 0.8
> Reporter: Rod Taylor
> Assignee: Julien Nioche
> Fix For: 1.9
>
> Attachments: ratelimit.patch
>
>
> Increases or decreases the number of threads from the starting value
> (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve
> a target bandwidth (fetcher.threads.bandwidth).
> It seems to be able to keep within 10% of the target bandwidth even when
> large numbers of errors are found or when a number of large pages is run
> across.
> To achieve more accurate tracking Nutch should keep track of protocol
> overhead as well as the volume of pages downloaded.
--
This message was sent by Atlassian JIRA
(v6.2#6252)