Jason Lowe created MAPREDUCE-4999:
-------------------------------------
Summary: AM attempt ended up in ERROR state and generated history
after node decommissioned
Key: MAPREDUCE-4999
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4999
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mr-am
Affects Versions: 0.23.6
Reporter: Jason Lowe
Saw a case where a job recorded history for an app attempt that ended up in the
ERROR state after the node the AM was running on was decommissioned. When the
node was decommissioned, the RM marked all the containers on the node as killed
and subsequently the application attempt was invalidated. When the AM attempt
heartbeated in before the NM did (and therefore before the NM killed the AM) it
discovered it was no longer a valid app attempt and exited in the ERROR state.
However it also thought, incorrectly, that it was the last attempt and
generated the history for the job.
Decommissioning a node should not cause an app attempt to end up in the ERROR
state with history, as the subsequent app attempt should be the one to generate
the definitive history for the job.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira