[ 
https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542794
 ] 

Owen O'Malley commented on HADOOP-1984:
---------------------------------------

Runping, it doesn't get to 10 minutes until it has failed 5 times. And it can 
easily take 10 minutes to clear a backlog off of a task tracker that is getting 
slammed. I certainly have seen jobs that longer than that to work off the 
backlog. I still maintain that a simple exponential back off is the right 
approach, because there are a lot of things that could have caused the slow 
down. 

Devaraj, please don't change the failure notification policy in this same bug. 
If it needs to be changed, it should be a different issue. Just changing the 
default number of retries in this issue is ok, but I don't think we should 
change the policy for *that* in this issue. Furthermore, if we do change the 
policy, I'd argue for something much more direct and say that if a tracker is 
black listed for a job, the number of retries should be cut in half or 
something.

> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
>                 Key: HADOOP-1984
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1984
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.0
>            Reporter: Runping Qi
>            Assignee: Amar Kamat
>            Priority: Critical
>             Fix For: 0.16.0
>
>         Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely 
> slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of 
> otherwise well tuned map/red jobs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to