[
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14096281#comment-14096281
]
Zhijie Shen commented on MAPREDUCE-6024:
----------------------------------------
[~zhaoyunjiong], thanks again for updating the patch, but I'm not sure the new
change is going to work:
1. The following code is moved out of copyFailed, and is no longer increased
for every failed maps on a host.
{code}
+ if (hostFailures.containsKey(hostname)) {
+ IntWritable x = hostFailures.get(hostname);
+ x.set(x.get() + 1);
+ } else {
+ hostFailures.put(hostname, new IntWritable(1));
+ }
{code}
2. They are not necessary to be the same. See Fetcher#copyMapOutput for details.
bq. failedTasks.size is equal to the remaining.size:
I did more investigation:
1. failureCounts is the # of failures of one map, while hostFailures is the #
of failures on one host, and it holds:
{code}
failureCounts(map1) + failureCounts(map2) + .. + failureCounts(mapk) =
hostFailures
{code}
failureCounts needs to be increased for each failure, such that it should be
kept in copyFailed.
2. The following logic controls the limit of the failures of a particular map.
{code}
if (failures >= abortFailureLimit) {
try {
throw new IOException(failures + " failures downloading " + mapId);
} catch (IOException ie) {
reporter.reportException(ie);
}
}
{code}
The logic problem around hostFailures is that we trace its increase and
decrease, but we never make use of it to determine whether to report the
exception or not. I think what we can do with hostFailures is similar to
failureCounts: comparing it with the limit of the failures of a particular
host, and reporting the exception if it goes beyond the limit.
W.R.T hostFailures issue above, I thought here is whether improvement 1 & 2 are
already enough for you to decrease the waiting time, because the existing
abortFailureLimit should have the similar affect as the maxHostFailures is
supposed to do.
In addition, I've thought more about MAX_ALLOWED_FETCH_FAILURES_FRACTION. Maybe
we want to get it back, but make it configurable as the original patch does.
It's tradeoff between response time and fault tolerance (e.g. In the 1 reducer
case, the other user may still want to have second chance when fetcher fails in
default config). Let's keep to current choice, but be up to user to tune it.
> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1
> hour
> ---------------------------------------------------------------------------------
>
> Key: MAPREDUCE-6024
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: mr-am, task
> Reporter: zhaoyunjiong
> Assignee: zhaoyunjiong
> Priority: Critical
> Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.2.patch,
> MAPREDUCE-6024.3.patch, MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to
> fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
> fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl:
> fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
--
This message was sent by Atlassian JIRA
(v6.2#6252)