prophy Yan created HADOOP-9766: ---------------------------------- Summary: job can not recovery after restart resourcemanager Key: HADOOP-9766 URL: https://issues.apache.org/jira/browse/HADOOP-9766 Project: Hadoop Common Issue Type: Bug Components: test Affects Versions: 2.0.5-alpha Environment: CentOS5.3 JDK1.7.0_11 Reporter: prophy Yan Priority: Critical
Recently, i have test the function job recovery in the YARN framework, but it failed. first, i run the wordcount example program, and the i kill -9 the resourcemanager process on the server when the wordcount process in map 100%. the job will exit with error in minutes. second, i restart the resourcemanager on the server by user the 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue. Here is the YARN log: 013-07-23 16:05:21,472 INFO org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done launching container Container: [ContainerId: container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117, NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>, Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002 2013-07-23 16:05:21,473 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED 2013-07-23 16:05:21,925 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815) at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176) at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284) at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280) at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) at LocalTrace: org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217) at org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218) at org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46) at org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735) .Failing this attempt.. Failing the application. 2013-07-23 16:05:21,935 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: application_1374564764970_0001 State change from ACCEPTED to FAILED 2013-07-23 16:05:21,937 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool OPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to AM Container for appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to: RemoteTrace: java.io.FileNotFoundException: File does not exist: hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens this is the log in YARN-logfile after i restart the resourcemanager -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira