[
https://issues.apache.org/jira/browse/MAPREDUCE-4013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230200#comment-13230200
]
Bhallamudi Venkata Siva Kamesh commented on MAPREDUCE-4013:
-----------------------------------------------------------
I *think*, the following could be the reason
We are initializing remainingMaps as totalMaps. But if we configure
*mapreduce.map.failures.maxpercent* as some non zero value, job will proceed to
run even some maps fail (configured %). However, decrementing count of
remainingMaps, only when the map output copy is sucessful. But even if a single
map fails, it will not be copied and so remainingMaps will be non zero always.
{code:title=ShuffleScheduler.java|borderStyle=solid}
if (--remainingMaps == 0) {
notifyAll();
}
{code}
> Reduce task gets stuck when a M/R job is configured to tolerate failures
> ------------------------------------------------------------------------
>
> Key: MAPREDUCE-4013
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-4013
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: mrv2
> Affects Versions: 0.23.2
> Reporter: Amar Kamat
> Priority: Blocker
> Labels: shuffle
> Fix For: 0.24.0
>
>
> When a M/R job is configured to run with some tolerance to task failures (via
> mapreduce.map.failures.maxpercent), then the reduce task of that job gets
> stuck in the shuffle phase.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira