[jira] [Resolved] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

Sebastian Nagel (JIRA) Sat, 20 Oct 2018 10:41:18 -0700


     [ 
https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sebastian Nagel resolved NUTCH-2652.
------------------------------------
    Resolution: Fixed
      Assignee: Sebastian Nagel

Merged into 1.x/master. The fix is already used in production - number of 
fetcher tasks is equal to number of fetch lists.

> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>
>                 Key: NUTCH-2652
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2652
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
> 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 
> 1.15. But the constraints causing inputs to be split may change from run to 
> run.
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 
> 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
> politeness and a guaranteed delay between requests to the same host/domain/ip 
> all items of one host/domain/ip are put by Generator into the same fetch 
> list. A fetch list may not be split because that would violate the politeness 
> constraints - multiple fetcher tasks processing the splits of one fetch list 
> then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
> chapter about Nutch in [Hadoop the definitive guide (3rd 
> edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Resolved] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

Reply via email to