[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13757316#comment-13757316
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-5471:
----------------------------------------------------

Synced up offline with [~jianhe] and [~bikassaha]. Seems like the correct 
solution for this and YARN-540 is work-preserving restart. That said, we can 
have a short-term 'fix' for MapReduce. If the MR AM scans staging directory, 
followed by intermediate directory and the done directory before starting the 
new AM's jobs, we should be good. The tricky part is to locate the file in the 
done directory. Let's see if we can do that.
                
> Succeed job tries to restart after RMrestart
> --------------------------------------------
>
>                 Key: MAPREDUCE-5471
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5471
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: yeshavora
>            Assignee: Jian He
>            Priority: Blocker
>         Attachments: MR5471-1AM.log, MR5471-2AM.log
>
>
> Run a job , restart RM when job just finished. It should not restart the job 
> once it Succeed.
> After RM restart, The AM of restarted job fails with below error.
> AM log after Rmrestart:
> 013-08-19 17:29:21,144 INFO [main] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopping 
> JobHistoryEventHandler. Size of the outstanding queue size is 0
> 2013-08-19 17:29:21,145 INFO [main] 
> org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler: Stopped 
> JobHistoryEventHandler. super.stop()
> 2013-08-19 17:29:21,146 INFO [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Deleting staging directory 
> hdfs://host1:port1/user/ABC/.staging/job_1376933101704_0001
> 2013-08-19 17:29:21,156 FATAL [main] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: Error starting MRAppMaster
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.io.FileNotFoundException: File does not exist: 
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1469)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1324)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.transition(JobImpl.java:1291)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:922)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl.handle(JobImpl.java:131)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$JobEventDispatcher.handle(MRAppMaster.java:1184)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceStart(MRAppMaster.java:995)
>         at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$1.run(MRAppMaster.java:1394)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1477)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1390)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1323)
> Caused by: java.io.FileNotFoundException: File does not exist: 
> hdfs://host1:port1/ABC/.staging/job_1376933101704_0001/job.splitmetainfo
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1121)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1113)
>         at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:78)
>         at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1113)
>         at 
> org.apache.hadoop.mapreduce.split.SplitMetaInfoReader.readSplitMetaInfo(SplitMetaInfoReader.java:51)
>         at 
> org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl$InitTransition.createSplits(JobImpl.java:1464)
>         ... 17 more
> 2013-08-19 17:29:21,158 INFO [Thread-2] 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster: MRAppMaster received a 
> signal. Signaling RMCommunicator and JobHistoryEventHandler.
> 2013-08-19 17:29:21,159 WARN [Thread-2] 
> org.apache.hadoop.util.ShutdownHookManager: ShutdownHook 
> 'MRAppMasterShutdownHook' failed, java.lang.NullPointerException
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$ContainerAllocatorRouter.setSignalled(MRAppMaster.java:805)
>         at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$MRAppMasterShutdownHook.run(MRAppMaster.java:1344)
>         at 
> org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to