[ https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15000291#comment-15000291 ]
Bikas Saha commented on MAPREDUCE-5485: --------------------------------------- bq. The cleanupInterruptedCommit() already check previous job commit succeed or failed. Am I missing anything here? This introduces duplication of code for checking commit status and can cause a bug if the logic changes in either place. And also makes extra RPC calls to HDFS for checking file status - which is avoidable. Moving the code to the place where earlier we were failing due to in-progress commit, will allow this method to do exactly as it name suggests - cleanup in progress commit markers. Does that clarify? Should we say previous AM failures to be precise? {code}+ * If repeatable job commit is supported, job restart can tolerate previous + * failures during job commit.{code} To be clear, we should look at adding 2 more tests. 1) Test MR Appmaster new functionality that allows commit to proceed in a retried AM if commit is repeatable. 2) Test in FileOutputCommitter that for repeatable commit - a filenotfoundexception is not counted as an error (new behavior). Maybe the patch missed adding some new changed file? Sorry if I missed something and the tests already exist. > Allow repeating job commit by extending OutputCommitter API > ----------------------------------------------------------- > > Key: MAPREDUCE-5485 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Affects Versions: 2.1.0-beta > Reporter: Nemon Lou > Assignee: Junping Du > Priority: Critical > Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, > MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, > MAPREDUCE-5485-v3.patch, MAPREDUCE-5485-v4.1.patch, MAPREDUCE-5485-v4.patch > > > There are chances MRAppMaster crush during job committing,or NodeManager > restart cause the committing AM exit due to container expire.In these cases > ,the job will fail. > However,some jobs can redo commit so failing the job becomes unnecessary. > Let clients tell AM to allow redo commit or not is a better choice. > This idea comes from Jason Lowe's comments in MAPREDUCE-4819 -- This message was sent by Atlassian JIRA (v6.3.4#6332)