[
https://issues.apache.org/jira/browse/HADOOP-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
prophy Yan updated HADOOP-9766:
-------------------------------
Description:
Recently, i have test the function job recovery in the YARN framework, but it
failed.
first, i run the wordcount example program, and the i kill -9 the
resourcemanager process on the server when the wordcount process in map 100%.
the job will exit with error in minutes.
second, i restart the resourcemanager on the server by user the 'start-yarn.sh'
command. but, the failed job(wordcount) can not to continue.
the yarn log says "file not exist!"
Here is the YARN log:
013-07-23 16:05:21,472 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
launching container Container: [ContainerId:
container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117,
NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>,
Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {,
application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, },
id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002
2013-07-23 16:05:21,473 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED
2013-07-23 16:05:21,925 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED
2013-07-23 16:05:21,925 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application
application_1374564764970_0001 failed 1 times due to AM Container for
appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to:
RemoteTrace:
java.io.FileNotFoundException: File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
at LocalTrace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218)
at
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
at
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
.Failing this attempt.. Failing the application.
2013-07-23 16:05:21,935 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1374564764970_0001 State change from ACCEPTED to FAILED
2013-07-23 16:05:21,937 WARN
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool
OPERATION=Application Finished - Failed TARGET=RMAppManager
RESULT=FAILURE DESCRIPTION=App failed with state: FAILED
PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to AM
Container for appattempt_1374564764970_0001_000002 exited with exitCode: -1000
due to: RemoteTrace:
java.io.FileNotFoundException: File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
this is the log in YARN-logfile after i restart the resourcemanager
was:
Recently, i have test the function job recovery in the YARN framework, but it
failed.
first, i run the wordcount example program, and the i kill -9 the
resourcemanager process on the server when the wordcount process in map 100%.
the job will exit with error in minutes.
second, i restart the resourcemanager on the server by user the 'start-yarn.sh'
command. but, the failed job(wordcount) can not to continue.
Here is the YARN log:
013-07-23 16:05:21,472 INFO
org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
launching container Container: [ContainerId:
container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117,
NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>,
Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id {,
application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId: 2, },
id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002
2013-07-23 16:05:21,473 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED
2013-07-23 16:05:21,925 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED
2013-07-23 16:05:21,925 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application
application_1374564764970_0001 failed 1 times due to AM Container for
appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to:
RemoteTrace:
java.io.FileNotFoundException: File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
at LocalTrace:
org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
at
org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218)
at
org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
at
org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
.Failing this attempt.. Failing the application.
2013-07-23 16:05:21,935 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1374564764970_0001 State change from ACCEPTED to FAILED
2013-07-23 16:05:21,937 WARN
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool
OPERATION=Application Finished - Failed TARGET=RMAppManager
RESULT=FAILURE DESCRIPTION=App failed with state: FAILED
PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to AM
Container for appattempt_1374564764970_0001_000002 exited with exitCode: -1000
due to: RemoteTrace:
java.io.FileNotFoundException: File does not exist:
hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
this is the log in YARN-logfile after i restart the resourcemanager
> job can not recovery after restart resourcemanager
> --------------------------------------------------
>
> Key: HADOOP-9766
> URL: https://issues.apache.org/jira/browse/HADOOP-9766
> Project: Hadoop Common
> Issue Type: Bug
> Components: test
> Affects Versions: 2.0.5-alpha
> Environment: CentOS5.3 JDK1.7.0_11
> Reporter: prophy Yan
> Priority: Critical
>
> Recently, i have test the function job recovery in the YARN framework, but it
> failed.
> first, i run the wordcount example program, and the i kill -9 the
> resourcemanager process on the server when the wordcount process in map 100%.
> the job will exit with error in minutes.
> second, i restart the resourcemanager on the server by user the
> 'start-yarn.sh' command. but, the failed job(wordcount) can not to continue.
> the yarn log says "file not exist!"
> Here is the YARN log:
> 013-07-23 16:05:21,472 INFO
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher: Done
> launching container Container: [ContainerId:
> container_1374564764970_0001_02_000001, NodeId: mv8.mzhen.cn:52117,
> NodeHttpAddress: mv8.mzhen.cn:8042, Resource: <memory:2048, vCores:1>,
> Priority: 0, State: NEW, Token: null, Status: container_id {, app_attempt_id
> {, application_id {, id: 1, cluster_timestamp: 1374564764970, }, attemptId:
> 2, }, id: 1, }, state: C_NEW, ] for AM appattempt_1374564764970_0001_000002
> 2013-07-23 16:05:21,473 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1374564764970_0001_000002 State change from ALLOCATED to LAUNCHED
> 2013-07-23 16:05:21,925 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> appattempt_1374564764970_0001_000002 State change from LAUNCHED to FAILED
> 2013-07-23 16:05:21,925 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Application
> application_1374564764970_0001 failed 1 times due to AM Container for
> appattempt_1374564764970_0001_000002 exited with exitCode: -1000 due to:
> RemoteTrace:
> java.io.FileNotFoundException: File does not exist:
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:815)
> at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:176)
> at
> org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:51)
> at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:284)
> at org.apache.hadoop.yarn.util.FSDownload$1.run(FSDownload.java:282)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:280)
> at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:51)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> at LocalTrace:
> org.apache.hadoop.yarn.exceptions.impl.pb.YarnRemoteExceptionPBImpl:
> File does not exist:
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> at
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.convertFromProtoFormat(LocalResourceStatusPBImpl.java:217)
> at
> org.apache.hadoop.yarn.server.nodemanager.api.protocolrecords.impl.pb.LocalResourceStatusPBImpl.getException(LocalResourceStatusPBImpl.java:147)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.update(ResourceLocalizationService.java:819)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker.processHeartbeat(ResourceLocalizationService.java:491)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService.heartbeat(ResourceLocalizationService.java:218)
> at
> org.apache.hadoop.yarn.server.nodemanager.api.impl.pb.service.LocalizationProtocolPBServiceImpl.heartbeat(LocalizationProtocolPBServiceImpl.java:46)
> at
> org.apache.hadoop.yarn.proto.LocalizationProtocol$LocalizationProtocolService$2.callBlockingMethod(LocalizationProtocol.java:57)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:454)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1014)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1741)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1737)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1478)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1735)
> .Failing this attempt.. Failing the application.
> 2013-07-23 16:05:21,935 INFO
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
> application_1374564764970_0001 State change from ACCEPTED to FAILED
> 2013-07-23 16:05:21,937 WARN
> org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=supertool
> OPERATION=Application Finished - Failed TARGET=RMAppManager
> RESULT=FAILURE DESCRIPTION=App failed with state: FAILED
> PERMISSIONS=Application application_1374564764970_0001 failed 1 times due to
> AM Container for appattempt_1374564764970_0001_000002 exited with exitCode:
> -1000 due to: RemoteTrace:
> java.io.FileNotFoundException: File does not exist:
> hdfs://ns1:8020/tmp/hadoop-yarn/staging/supertool/.staging/job_1374564764970_0001/appTokens
> this is the log in YARN-logfile after i restart the resourcemanager
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira