[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

Zhijie Shen (JIRA) Wed, 13 Aug 2014 16:14:42 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096281#comment-14096281
 ]


Zhijie Shen commented on MAPREDUCE-6024:
----------------------------------------

[~zhaoyunjiong], thanks again for updating the patch, but I'm not sure the new 
change is going to work:

1. The following code is moved out of copyFailed, and is no longer increased 
for every failed maps on a host.
{code}
+    if (hostFailures.containsKey(hostname)) {
+      IntWritable x = hostFailures.get(hostname);
+      x.set(x.get() + 1);
+    } else {
+      hostFailures.put(hostname, new IntWritable(1));
+    }
{code}

2. They are not necessary to be the same. See Fetcher#copyMapOutput for details.
bq. failedTasks.size is equal to the remaining.size:

I did more investigation:

1. failureCounts is the # of failures of one map, while hostFailures is the # 
of failures on one host, and it holds:
{code}
failureCounts(map1) + failureCounts(map2) + .. + failureCounts(mapk) = 
hostFailures
{code}
failureCounts needs to be increased for each failure, such that it should be 
kept in copyFailed.

2. The following logic controls the limit of the failures of a particular map.
{code}
if (failures >= abortFailureLimit) {
      try {
        throw new IOException(failures + " failures downloading " + mapId);
      } catch (IOException ie) {
        reporter.reportException(ie);
      }
    }
{code}
The logic problem around hostFailures is that we trace its increase and 
decrease, but we never make use of it to determine whether to report the 
exception or not. I think what we can do with hostFailures is similar to 
failureCounts: comparing it with the limit of the failures of a particular 
host, and reporting the exception if it goes beyond the limit.

W.R.T hostFailures issue above, I thought here is whether improvement 1 & 2 are 
already enough for you to decrease the waiting time, because the existing 
abortFailureLimit should have the similar affect as the maxHostFailures is 
supposed to do.

In addition, I've thought more about MAX_ALLOWED_FETCH_FAILURES_FRACTION. Maybe 
we want to get it back, but make it configurable as the original patch does. 
It's tradeoff between response time and fault tolerance (e.g. In the 1 reducer 
case, the other user may still want to have second chance when fetcher fails in 
default config). Let's keep to current choice, but be up to user to tune it.

> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
> hour
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6024
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>            Reporter: zhaoyunjiong
>            Assignee: zhaoyunjiong
>            Priority: Critical
>         Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch, 
> MAPREDUCE-6024.3.patch, MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

Reply via email to