[ https://issues.apache.org/jira/browse/MAPREDUCE-7349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
gaoyu updated MAPREDUCE-7349: ----------------------------- Related cluster configuration: * MAX_FETCH_FAILURES_NOTIFICATIONS is 3 * NodeManager recovery is disabled Bug scenario: # submit a wordcount job which contains 2 simple map tasks ({{map_0}} and {{map_1}}) and 1 simple reduce task ({{reduce_0}}); # all map tasks were finished successfully and the AppMaster was notified; # the NodeManager which runs the map task {{map_1}} crashes; # the AppMaster schedules a reduce attempt; # the reduce attempt sends {{statusUpdate}} message to AppMaster to notify a fetch failure; # the reduce attempt fails due to {{Shuffle$ShuffleError}} which was caused by {{java.io.IOException: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out}}; # the reduce attempt send message {{fatalError}} to AppMaster # the AppMaster successively reschedules another three reduce attempts, but all of them were failed due to {{Shuffle$ShuffleError}}; # AppMaster fails the wordcount job due to the failed reduce task; # AppMaster receives three {{statusUpdate}} messages that state a fetch failure like the message in step 5, but it has already failed the job and would not rerun the task {{map_1}}. > An unexpected node crash and delayed messages would fail the job > ---------------------------------------------------------------- > > Key: MAPREDUCE-7349 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7349 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: applicationmaster > Affects Versions: 3.2.2 > Reporter: gaoyu > Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org