[ https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Joseph Evans updated MAPREDUCE-4819: ------------------------------------------- Attachment: MR-4819-bobby-trunk.txt Bikas, I would actually like to propose an alternative fix. I am attaching a very preliminary patch. This will instead put a "lock" around the job commit by adding a few new files into the staging directory. Task commits would be required to handle the rare possibility of a double commit, just as it is possible in 1.0 now. We would make it just as likely to happen as it is in 1.0 by also putting in MAPREDUCE-4832 which would help to ensure that we don't have two AM telling tasks to do things at the same time. I would appreciate any feedback on this approach. I am going to be working to add in more tests and clean up the code. > AM can rerun job after reporting final job status to the client > --------------------------------------------------------------- > > Key: MAPREDUCE-4819 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 0.23.3, 2.0.1-alpha > Reporter: Jason Lowe > Assignee: Bikas Saha > Priority: Critical > Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, > MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt > > > If the AM reports final job status to the client but then crashes before > unregistering with the RM then the RM can run another AM attempt. Currently > AM re-attempts assume that the previous attempts did not reach a final job > state, and that causes the job to rerun (from scratch, if the output format > doesn't support recovery). > Re-running the job when we've already told the client the final status of the > job is bad for a number of reasons. If the job failed, it's confusing at > best since the client was already told the job failed but the subsequent > attempt could succeed. If the job succeeded there could be data loss, as a > subsequent job launched by the client tries to consume the job's output as > input just as the re-attempt starts removing output files in preparation for > the output commit. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira