[ https://issues.apache.org/jira/browse/NUTCH-207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13988655#comment-13988655 ]
Sebastian Nagel commented on NUTCH-207: --------------------------------------- Looks good. - there are some collisions/overlaps with NUTCH-1182 - NPE if fetcher.bandwidth.target.check.everyNSecs set to 0 - conditions which disable the feature could be placed on top, e.g.: {code} if (targetBandwidth == -1 || maxNumThreads == -1) { // disabled } else if (bandwidthTargetCheckCounter < bandwidthTargetCheckEveryNSecs) { bandwidthTargetCheckCounter++; } ... {code} - in case "fetcher.maxNum.threads" is not set: could be initialized to number of threads. Would simplify configuration and code. - log output should always use same unit: "averageBdwPerThread : 36721680" vs. "Exceeding target bandwidth (110 vs 1 Mbps)" - missing space in log output: "Thread FetcherThreadhas no more work available" - wouldn't be kbps not a more appropriate unit for configuration? Do we really need granularity and precision of bps? Would also avoid int overflows in the near future with GBit networks. - line 1290-1: "check whether it is worth doing e.g. more queues than threads": shouldn't "fetcher.threads.per.host" be also taken into account? Then the bandwidth could be adjusted also when only one (local) server is crawled but with heavy load. > Bandwidth target for fetcher rather than a thread count > ------------------------------------------------------- > > Key: NUTCH-207 > URL: https://issues.apache.org/jira/browse/NUTCH-207 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.8 > Reporter: Rod Taylor > Assignee: Julien Nioche > Fix For: 1.9 > > Attachments: NUTCH-207.trunk.patch, ratelimit.patch > > > Increases or decreases the number of threads from the starting value > (fetcher.threads.fetch) up to a maximum (fetcher.threads.maximum) to achieve > a target bandwidth (fetcher.threads.bandwidth). > It seems to be able to keep within 10% of the target bandwidth even when > large numbers of errors are found or when a number of large pages is run > across. > To achieve more accurate tracking Nutch should keep track of protocol > overhead as well as the volume of pages downloaded. -- This message was sent by Atlassian JIRA (v6.2#6252)