[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

Zhijie Shen (JIRA) Tue, 12 Aug 2014 00:19:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093835#comment-14093835
 ]


Zhijie Shen commented on MAPREDUCE-6024:
----------------------------------------

bq. 1. For MAX_FETCH_FAILURES_NOTIFICATIONS, if change to proportional to the 
number of reducers, it will be same as MAX_ALLOWED_FETCH_FAILURES_FRACTION, so 
I deleted it. I do believe 

Sounds good to me. Under existing defaults, the only cases that failure will be 
triggered previously but not after the patch is fetchFailures <= 2 and 
shufflingReduceTasks <=3. According to the problem described in this jira, it 
makes sense to give fewer chances to the smaller number of reducer tasks. And 
if users really want to give the fetcher enough chance, it can tune 
MAX_ALLOWED_FETCH_FAILURES_FRACTION, and even make it go beyond 1.0.

bq. 4. Sometimes fetcher can get data successfully after retry from 
SocketTimeoutException, so I think let fetcher retry some times is OK.

Sounds reasonable. In addition, I linked back to the previous comments in 
[MAPREDUCE-4772|https://issues.apache.org/jira/browse/MAPREDUCE-4772?focusedCommentId=13492593&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13492593],
 which said connect exception more severe than timeout.

[~venkateshrin], do you have any further comments?

Some more comments:

1. maxfetchfailuresfraction -> max-fetch-failures-fraction? and maxhostfailures 
-> max-host-failures?
{code}
+  public static final String MAX_ALLOWED_FETCH_FAILURES_FRACTION = 
"mapreduce.reduce.shuffle.maxfetchfailuresfraction";
{code}
{code}
+  public static final String MAX_SHUFFLE_FETCH_HOST_FAILURES = 
"mapreduce.reduce.shuffle.maxhostfailures";
{code}

2. Is it necessary to multiply the failures by numMaps? copyFailed is in a loop 
and invoked for each remaining/failed task, right?
{code}
+    //report failure if already retried maxHostFailures times
+    boolean hostFail = hostFailures.get(hostname).get() > this.maxHostFailures
+        * numMaps ? true : false;
{code}

BTW, you may want to click "Submit Patch" to ask Jenkins to verify your patch.

> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
> hour
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6024
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>            Reporter: zhaoyunjiong
>            Assignee: zhaoyunjiong
>            Priority: Critical
>         Attachments: MAPREDUCE-6024.1.patch, MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-6024) java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 hour

Reply via email to