[jira] [Commented] (MAPREDUCE-5471) Succeed job tries to restart after RMrestart

Jason Lowe (JIRA) Wed, 21 Aug 2013 07:58:23 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13746094#comment-13746094
 ]


Jason Lowe commented on MAPREDUCE-5471:
---------------------------------------

This sounds like it could be a case where the first AM attempt succeeded, 
cleaned the staging directory, then the RM restarted before the AM could 
unregister.  Would need to examine the first AM attempt's logs to know for 
sure.  If that is indeed the case, then this is the same issue as described in 
YARN-917, and the fix would be to move the staging dir cleanup service to 
execute after the AM unregisters with the RM.
                
> Succeed job tries to restart after RMrestart
> --------------------------------------------
>
>                 Key: MAPREDUCE-5471
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5471
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: yeshavora
>            Assignee: Jian He
>            Priority: Blocker
>
> Run a job , restart RM when job just finished. It should not restart the job 
> once it Succeed.
> After RM restart, The AM of restarted job fails with below error.
> AM log after Rmrestart:
> 013-08-19 17:29:21,144 INFO [main] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2013-08-19 17:29:21,145 INFO [main] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped 
> JobHistoryEventHandler. super.stop()
> 2013-08-19 17:29:21,146 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting staging directory 
> hdfs://host1:port1/user/ABC/.staging/job_1376933101704_0001
> 2013-08-19 17:29:21,156 FATAL [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1469)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1324)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1291)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:922)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:131)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1184)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:995)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1394)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1390)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1323)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1121)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1113)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:78)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1113)
>         at 
> org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:51)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1464)
>         ... 17 more
> 2013-08-19 17:29:21,158 INFO [Thread-2] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2013-08-19 17:29:21,159 WARN [Thread-2] 
> org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> 'MRAppMasterShutdownHook' failed, java.lang.NullPointerException
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.setSignalled(MRAppMaster.java:805)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1344)
>         at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-5471) Succeed job tries to restart after RMrestart

Reply via email to