Sebastian Nagel created NUTCH-2652:
--------------------------------------

             Summary: Fetcher launches more fetch tasks than fetch lists
                 Key: NUTCH-2652
                 URL: https://issues.apache.org/jira/browse/NUTCH-2652
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.15
         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
5.15.1, Nutch built on recent master.

Seen the first time right now, although running since two months with Nutch 
1.15. But the constraints causing inputs to be split may change from run to run.
            Reporter: Sebastian Nagel
             Fix For: 1.16


Fetcher may launch more fetcher tasks than there are fetch lists:
{noformat}
18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
{noformat}
That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
politeness and a guaranteed delay between requests to the same host/domain/ip 
all items of one host/domain/ip are put by Generator into the same fetch list. 
A fetch list may not be split because that would violate the politeness 
constraints - multiple fetcher tasks processing the splits of one fetch list 
then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
chapter about Nutch in [Hadoop the definitive guide (3rd 
edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to