[ 
https://issues.apache.org/jira/browse/HADOOP-3130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584135#action_12584135
 ] 

Devaraj Das commented on HADOOP-3130:
-------------------------------------

I think it makes sense from the utilization point of view to have a smaller 
timeout. We free up a thread sooner and it can potentially successfully fetch 
from some other host. This needs to be benchmarked. But it also means that we 
need to keep an eye on the self-healing aspect - we kill reducers after they 
fail to fetch for a certain number of times (and connection establishment 
failure is a sign of failure currently). We might end up killing reducers 
sooner than we do it today. 
[For killing reducers, we probably should move to a model where we look at the 
global picture and use all information before killing a reducer (move this 
logic entirely to the JobTracker). So in the case of map output fetch failures 
the JT can decide whether to kill a reducer or not based on which map outputs 
the reducer is failing to fetch, and, whether those map nodes are healthy, etc.]

> Shuffling takes too long to get the last map output.
> ----------------------------------------------------
>
>                 Key: HADOOP-3130
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3130
>             Project: Hadoop Core
>          Issue Type: Bug
>            Reporter: Runping Qi
>         Attachments: HADOOP-3130.patch, shuffling.log
>
>
> I noticed that towards the end of shufflling, the map output fetcher of the 
> reducer backs off too aggressively.
> I attach a fraction of one reduce log of my job.
> Noticed that the last map output was not fetched in 2 minutes.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to