Jason Lowe created MAPREDUCE-4819:
-------------------------------------
Summary: AM can rerun job after reporting final job status to the
client
Key: MAPREDUCE-4819
URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
Project: Hadoop Map/Reduce
Issue Type: Bug
Components: mr-am
Affects Versions: 2.0.1-alpha, 0.23.3
Reporter: Jason Lowe
Priority: Critical
If the AM reports final job status to the client but then crashes before
unregistering with the RM then the RM can run another AM attempt. Currently AM
re-attempts assume that the previous attempts did not reach a final job state,
and that causes the job to rerun (from scratch, if the output format doesn't
support recovery).
Re-running the job when we've already told the client the final status of the
job is bad for a number of reasons. If the job failed, it's confusing at best
since the client was already told the job failed but the subsequent attempt
could succeed. If the job succeeded there could be data loss, as a subsequent
job launched by the client tries to consume the job's output as input just as
the re-attempt starts removing output files in preparation for the output
commit.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira