[ 
https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650111#comment-16650111
 ] 

ASF GitHub Bot commented on NUTCH-2652:
---------------------------------------

sebastian-nagel opened a new pull request #394: NUTCH-2652 Fetcher launches 
more fetch tasks than fetch lists
URL: https://github.com/apache/nutch/pull/394
 
 
   - properly override method [getSplits(JobContext context) of 
FileInputFormat](https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext))
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>
>                 Key: NUTCH-2652
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2652
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
> 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 
> 1.15. But the constraints causing inputs to be split may change from run to 
> run.
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 
> 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
> politeness and a guaranteed delay between requests to the same host/domain/ip 
> all items of one host/domain/ip are put by Generator into the same fetch 
> list. A fetch list may not be split because that would violate the politeness 
> constraints - multiple fetcher tasks processing the splits of one fetch list 
> then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
> chapter about Nutch in [Hadoop the definitive guide (3rd 
> edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to