[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092453#comment-14092453
 ] 

Zhijie Shen commented on MAPREDUCE-6024:
----------------------------------------

[~zhaoyunjiong], thanks for your contribution.

bq. 1. default value for MAX_FETCH_FAILURES_NOTIFICATIONS is 3, if there is 
only one reduce, it will take very long time before fetchFailures reach the 
threshold.

Inspired by this point, IMHO, a better idea is make 
MAX_FETCH_FAILURES_NOTIFICATIONS proportional to the number of reducers, and 
allow users to configure it as well.

bq. 2. decrease the default value MAX_ALLOWED_FETCH_FAILURES_FRACTION from 0.5 
to 0.3, and make it configurable.

It's good to be configurable, but how shall we still keep 0.5 as the default, 
as the value has been used for a long time.

bq. 3. if fetcher failed to fetch data from a host 5 times(about 20 minutes), 
inform job tracker the failure.

Sounds reasonable as well, but should we make this value configurable?

Setting aside the three ways to shorten the stuck period, it think the more 
straightforward fix for this problem is to deal with SocketTimeoutException 
properly. Bellow is the code in ShuffleSchedulerImpl#copyFromHost:
{code}
    } catch (IOException ie) {
      boolean connectExcpt = ie instanceof ConnectException;
      ioErrs.increment(1);
      LOG.warn("Failed to connect to " + host + " with " + remaining.size() + 
               " map outputs", ie);
{code}
ConnectException is treated specially, and set connectExcpt = true, and will 
consequently result in a fetch failure. IMHO, SocketTimeoutException could be 
consider as the connection issue as well. If we do this, the reduce task can 
jump out of the stuck stage more quickly. How do you think?

> java.net.SocketTimeoutException in Fetcher caused jobs stuck for more than 1 
> hour
> ---------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6024
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6024
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, task
>            Reporter: zhaoyunjiong
>            Assignee: zhaoyunjiong
>            Priority: Critical
>         Attachments: MAPREDUCE-6024.patch
>
>
> 2014-08-04 21:09:42,356 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 2 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> 2014-08-04 21:09:42,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 180024ms
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 21:09:55,360 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 21:12:55,463 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)
> ...
> 2014-08-04 22:03:13,416 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: 
> fake.host.name:13562 freed by fetcher#33 in 271081ms
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: Assigning 
> fake.host.name:13562 with 3 to fetcher#33
> 2014-08-04 22:04:13,417 INFO fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.ShuffleSchedulerImpl: assigned 3 of 3 
> to fake.host.name:13562 to fetcher#33
> 2014-08-04 22:07:13,449 WARN fetcher#33 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher: Failed to connect to 
> fake.host.name:13562 with 3 map outputs
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:129)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:697)
> at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:640)
> at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1195)
> at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:289)
> at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:165)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to