[
https://issues.apache.org/jira/browse/MAPREDUCE-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746655#comment-13746655
]
Jian He commented on MAPREDUCE-5471:
------------------------------------
The current RMStateStore implementation doesn't store the any information about
the completion status of the attempts, it just removes the app info when app
completes. When RM comes back, it scans the store and deems every app remain in
the store as not completed and resubmit the job. Agree that we need some kind
of persistent flag to indicate the previous attempt completes or not. As a temp
fix to reduce the race as mentioned in YARN-540, clean the app info immediately
after the unregister call.
> Succeed job tries to restart after RMrestart
> --------------------------------------------
>
> Key: MAPREDUCE-5471
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5471
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: yeshavora
> Assignee: Jian He
> Priority: Blocker
>
> Run a job , restart RM when job just finished. It should not restart the job
> once it Succeed.
> After RM restart, The AM of restarted job fails with below error.
> AM log after Rmrestart:
> 013-08-19 17:29:21,144 INFO [main]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2013-08-19 17:29:21,145 INFO [main]
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped
> JobHistoryEventHandler. super.stop()
> 2013-08-19 17:29:21,146 INFO [main]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting staging directory
> hdfs://host1:port1/user/ABC/.staging/job_1376933101704_0001
> 2013-08-19 17:29:21,156 FATAL [main]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
> java.io.FileNotFoundException: File does not exist:
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1469)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1324)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1291)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:922)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:131)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1184)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:995)
> at
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1394)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1390)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1323)
> Caused by: java.io.FileNotFoundException: File does not exist:
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1121)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1113)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:78)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1113)
> at
> org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:51)
> at
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1464)
> ... 17 more
> 2013-08-19 17:29:21,158 INFO [Thread-2]
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2013-08-19 17:29:21,159 WARN [Thread-2]
> org.apache.hadoop.util.ShutdownHookManager: ShutdownHook
> 'MRAppMasterShutdownHook' failed, java.lang.NullPointerException
> java.lang.NullPointerException
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.setSignalled(MRAppMaster.java:805)
> at
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1344)
> at
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira