[ https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel resolved NUTCH-2652. ------------------------------------ Resolution: Fixed Assignee: Sebastian Nagel Merged into 1.x/master. The fix is already used in production - number of fetcher tasks is equal to number of fetch lists. > Fetcher launches more fetch tasks than fetch lists > -------------------------------------------------- > > Key: NUTCH-2652 > URL: https://issues.apache.org/jira/browse/NUTCH-2652 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.15 > Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH > 5.15.1, Nutch built on recent master. > Seen the first time right now, although running since two months with Nutch > 1.15. But the constraints causing inputs to be split may change from run to > run. > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Critical > Fix For: 1.16 > > > Fetcher may launch more fetcher tasks than there are fetch lists: > {noformat} > 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : > 128 > 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187 > {noformat} > That's one design principle of Nutch as a MapRecude-based crawler: to ensure > politeness and a guaranteed delay between requests to the same host/domain/ip > all items of one host/domain/ip are put by Generator into the same fetch > list. A fetch list may not be split because that would violate the politeness > constraints - multiple fetcher tasks processing the splits of one fetch list > then may send requests to the same host/domain/ip in parallel. See [~ab]'s > chapter about Nutch in [Hadoop the definitive guide (3rd > edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)