[
https://issues.apache.org/jira/browse/FLINK-4142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379079#comment-15379079
]
Robert Metzger commented on FLINK-4142:
---------------------------------------
Thank you for posting a log as well.
It seems to be a YARN specific issue:
{code}
2016-07-01 15:45:03,452 INFO org.apache.flink.yarn.YarnFlinkResourceManager
- Launching TaskManager in container ContainerInLaunch @
1467387903451: Container: [ContainerId: container_1467387670862_0001_02_000002,
NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436,
NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042,
Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind:
ContainerToken, service: 10.240.0.18:40436 }, ] on host
hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal
2016-07-01 15:45:03,455 INFO
org.apache.hadoop.yarn.client.api.impl.ContainerManagementProtocolProxy -
Opening proxy : hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436
2016-07-01 15:45:03,508 ERROR org.apache.flink.yarn.YarnFlinkResourceManager
- Could not start TaskManager in container ContainerInLaunch @
1467387903451: Container: [ContainerId: container_1467387670862_0001_02_000002,
NodeId: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:40436,
NodeHttpAddress: hadoop-srichter-worker-3-vm.c.astral-sorter-757.internal:8042,
Resource: <memory:4096, vCores:1>, Priority: 0, Token: Token { kind:
ContainerToken, service: 10.240.0.18:40436 }, ]
org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start
container.
NMToken for application attempt : appattempt_1467387670862_0001_000001 was used
for starting container with container token issued for application attempt :
appattempt_1467387670862_0001_000002
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
at
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:206)
at
org.apache.flink.yarn.YarnFlinkResourceManager.containersAllocated(YarnFlinkResourceManager.java:403)
at
org.apache.flink.yarn.YarnFlinkResourceManager.handleMessage(YarnFlinkResourceManager.java:164)
at
org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:90)
at
org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:70)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:167)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:97)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254)
at akka.dispatch.Mailbox.run(Mailbox.scala:221)
at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-07-01 15:45:03,508 INFO org.apache.flink.yarn.YarnFlinkResourceManager
- Requesting new TaskManager container with 4096 megabytes memory.
Pending requests: 1
2016-07-01 15:45:03,959 INFO org.apache.flink.yarn.YarnFlinkResourceManager
- Container
ResourceID{resourceId='container_1467387670862_0001_02_000002'} completed
successfully with diagnostics: Container released by application
{code}
The problem was a major bug in Hadoop 2.4.0. It has been fixed in Hadoop 2.5.0.
https://issues.apache.org/jira/browse/YARN-2065
I'll add a warning to the YARN documentation page that there are issues with HA
on YARN < 2.5.0.
> Recovery problem in HA on Hadoop Yarn 2.4.1
> -------------------------------------------
>
> Key: FLINK-4142
> URL: https://issues.apache.org/jira/browse/FLINK-4142
> Project: Flink
> Issue Type: Bug
> Components: YARN Client
> Affects Versions: 1.0.3
> Reporter: Stefan Richter
>
> On Hadoop Yarn 2.4.1, recovery in HA fails in the following scenario:
> 1) Kill application master, let it recover normally.
> 2) After that, kill a task manager.
> Now, Yarn tries to restart the killed task manager in an endless loop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)