[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

ASF GitHub Bot (JIRA) Sat, 20 Oct 2018 10:37:14 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16657930#comment-16657930
 ]


ASF GitHub Bot commented on NUTCH-2652:
---------------------------------------

sebastian-nagel closed pull request #394: NUTCH-2652 Fetcher launches more 
fetch tasks than fetch lists
URL: https://github.com/apache/nutch/pull/394
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/fetcher/Fetcher.java 
b/src/java/org/apache/nutch/fetcher/Fetcher.java
index f6584c560..fe9e71ecb 100644
--- a/src/java/org/apache/nutch/fetcher/Fetcher.java
+++ b/src/java/org/apache/nutch/fetcher/Fetcher.java
@@ -23,28 +23,24 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.HashMap;
-import java.util.Iterator;
 import java.util.LinkedList;
 import java.util.List;
 import java.util.Map;
 import java.util.concurrent.atomic.AtomicInteger;
 import java.util.concurrent.atomic.AtomicLong;
 
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
-
 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.Text;
+import org.apache.hadoop.mapreduce.InputSplit;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.JobContext;
 import org.apache.hadoop.mapreduce.Mapper;
-import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
+import org.apache.hadoop.mapreduce.lib.input.FileSplit;
+import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
-import org.apache.hadoop.mapreduce.InputSplit;
-import org.apache.hadoop.mapred.FileSplit;
 import org.apache.hadoop.util.StringUtils;
 import org.apache.hadoop.util.Tool;
 import org.apache.hadoop.util.ToolRunner;
@@ -55,6 +51,8 @@
 import org.apache.nutch.util.NutchJob;
 import org.apache.nutch.util.NutchTool;
 import org.apache.nutch.util.TimingUtil;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
 
 /**
  * A queue-based fetcher.
@@ -105,19 +103,20 @@
   private static final Logger LOG = LoggerFactory
       .getLogger(MethodHandles.lookup().lookupClass());
 
-  public static class InputFormat extends
-  SequenceFileInputFormat<Text, CrawlDatum> {
-    /** Don't split inputs, to keep things polite. */
-    public InputSplit[] getSplits(JobContext job, int nSplits) throws 
IOException {
+  public static class InputFormat
+      extends SequenceFileInputFormat<Text, CrawlDatum> {
+    /**
+     * Don't split inputs to keep things polite - a single fetch list must be
+     * processed in one fetcher task. Do not split a fetch lists and assigning
+     * the splits to multiple parallel tasks.
+     */
+    @Override
+    public List<InputSplit> getSplits(JobContext job) throws IOException {
       List<FileStatus> files = listStatus(job);
-      FileSplit[] splits = new FileSplit[files.size()];
-      Iterator<FileStatus> iterator= files.listIterator();
-      int index = 0;
-      while(iterator.hasNext()) {
-        index++;
-        FileStatus cur = iterator.next();
-        splits[index] = new FileSplit(cur.getPath(), 0, cur.getLen(),
-            (String[]) null);
+      List<InputSplit> splits = new ArrayList<>();
+      for (FileStatus cur : files) {
+        splits.add(
+            new FileSplit(cur.getPath(), 0, cur.getLen(), (String[]) null));
       }
       return splits;
     }


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher launches more fetch tasks than fetch lists
> --------------------------------------------------
>
>                 Key: NUTCH-2652
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2652
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 1.15
>         Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
> 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 
> 1.15. But the constraints causing inputs to be split may change from run to 
> run.
>            Reporter: Sebastian Nagel
>            Priority: Critical
>             Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 
> 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
> politeness and a guaranteed delay between requests to the same host/domain/ip 
> all items of one host/domain/ip are put by Generator into the same fetch 
> list. A fetch list may not be split because that would violate the politeness 
> constraints - multiple fetcher tasks processing the splits of one fetch list 
> then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
> chapter about Nutch in [Hadoop the definitive guide (3rd 
> edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

Reply via email to