[
https://issues.apache.org/jira/browse/MAPREDUCE-4775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492458#comment-13492458
]
Robert Joseph Evans commented on MAPREDUCE-4775:
------------------------------------------------
OK so I missed some of the code in shuffleScheduler.checkReducerHealth(). The
stall check is in there, but the previous check for a single map attempt is
completely useless at this point. Dropping the severity accordingly.
Robert Joseph Evans added a comment. I am also confused why a reducer could be
stalled for over an hour (MAPREDUCE-4772) and not be killed. I will look into
that here too.
> Reducer will "never" commit suicide
> -----------------------------------
>
> Key: MAPREDUCE-4775
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4775
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Reporter: Robert Joseph Evans
> Assignee: Robert Joseph Evans
> Priority: Critical
>
> In 1.0 there are a number of conditions that will cause a reducer to commit
> suicide and exit.
> This includes if it is stalled, if the error percentage of total fetches is
> too high. In the new code it will only commit suicide when the total number
> of failures for a single task attempt is >= max(30, totalMaps/10). In the
> best case with the quadratic back-off to get a single map attempt to reach 30
> failure it would take 20.5 hours. And unless there is only one reducer
> running the map task would have been restarted before then.
> We should go back to include the same reducer suicide checks that are in 1.0
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira