[ 
https://issues.apache.org/jira/browse/HADOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12585546#action_12585546
 ] 

Runping Qi commented on HADOOP-3130:
------------------------------------



Lot of reducers failed with the following message:

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: 
Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.Shuffle Error: Exceeded 
MAX_FAILED_UNIQUE_FETCHES; bailing-out.
  

I see a lot of the following exceptions in the log:

2008-04-04 13:50:03,796 WARN org.apache.hadoop.mapred.ReduceTask: 
task_200804041304_0005_r_000000_2 copy failed: 
task_200804041304_0005_m_000181_0 from xxxx.com
2008-04-04 13:50:03,823 WARN org.apache.hadoop.mapred.ReduceTask: 
java.net.SocketTimeoutException: Read timed out
        at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source)
        at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
        at 
sun.net.www.protocol.http.HttpURLConnection$6.run(HttpURLConnection.java:1298)
        at java.security.AccessController.doPrivileged(Native Method)
        at 
sun.net.www.protocol.http.HttpURLConnection.getChainedException(HttpURLConnection.java:1292)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:948)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getInputStream(MapOutputLocation.java:125)
        at 
org.apache.hadoop.mapred.MapOutputLocation.getFile(MapOutputLocation.java:165)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:815)
        at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:764)
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:129)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:632)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:577)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1004)
        ... 4 more

Did you also change the timeout for read?

what is the value for Exceeded MAX_FAILED_UNIQUE_FETCHES?
Should that be some percentage of the total num of maps?

Anyhow, we need to revisit the policy for failing a reducer during shuffling.


> Shuffling takes too long to get the last map output.
> ----------------------------------------------------
>
>                 Key: HADOOP-3130
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3130
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3130-v2.patch, HADOOP-3130.patch, shuffling.log
>
>
> I noticed that towards the end of shufflling, the map output fetcher of the 
> reducer backs off too aggressively.
> I attach a fraction of one reduce log of my job.
> Noticed that the last map output was not fetched in 2 minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to