[
https://issues.apache.org/jira/browse/HADOOP-1984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542794
]
Owen O'Malley commented on HADOOP-1984:
---------------------------------------
Runping, it doesn't get to 10 minutes until it has failed 5 times. And it can
easily take 10 minutes to clear a backlog off of a task tracker that is getting
slammed. I certainly have seen jobs that longer than that to work off the
backlog. I still maintain that a simple exponential back off is the right
approach, because there are a lot of things that could have caused the slow
down.
Devaraj, please don't change the failure notification policy in this same bug.
If it needs to be changed, it should be a different issue. Just changing the
default number of retries in this issue is ok, but I don't think we should
change the policy for *that* in this issue. Furthermore, if we do change the
policy, I'd argue for something much more direct and say that if a tracker is
black listed for a job, the number of retries should be cut in half or
something.
> some reducer stuck at copy phase and progress extremely slowly
> --------------------------------------------------------------
>
> Key: HADOOP-1984
> URL: https://issues.apache.org/jira/browse/HADOOP-1984
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Reporter: Runping Qi
> Assignee: Amar Kamat
> Priority: Critical
> Fix For: 0.16.0
>
> Attachments: HADOOP-1984.patch
>
>
> In many cases, some reducers got stuck at copy phase, progressing extremely
> slowly.
> The entire cluster seems doing nothing. This causes a very bad long tails of
> otherwise well tuned map/red jobs.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.