[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhaoyunjiong updated MAPREDUCE-6024:
------------------------------------

    Attachment: MAPREDUCE-6024.4.patch

Upload a new patch, get MAX_FETCH_FAILURES_NOTIFICATIONS back and make it 
configurable. 
For hostFailures, in my patch, I already use it, below are related code:
{code}

       // If connect did not succeed, just mark all the maps as failed,
       // indirectly penalizing the host
+      scheduler.hostFailed(host.getHostName());
       for(TaskAttemptID left: remaining) {
         scheduler.copyFailed(left, host, false, connectExcpt);
       }

+       scheduler.hostFailed(host.getHostName());
         for(TaskAttemptID left: failedTasks) {
           scheduler.copyFailed(left, host, true, false);
         }

+  public synchronized void hostFailed(String hostname) {
+    if (hostFailures.containsKey(hostname)) {
+      IntWritable x = hostFailures.get(hostname);
+      x.set(x.get() + 1);
+    } else {
+      hostFailures.put(hostname, new IntWritable(1));
+    }
+  }

+ boolean hostFail = hostFailures.get(hostname).get() > getMaxHostFailures() ? 
true : false;

+    checkAndInformJobTracker(failures, mapId, readError, connectExcpt, 
hostFail);
{code}

> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
> hour
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6024
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>            Reporter: zhaoyunjiong
>            Assignee: zhaoyunjiong
>            Priority: Critical
>         Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, 
> MAPREDUCE-6024.3.patch, MAPREDUCE-6024.4.patch, MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to