[
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17007184#comment-17007184
]
ouyangwulin commented on FLINK-12382:
-------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: Could not retrieve
the execution result. (JobID: c71c27d2fdfdc7d75686e820acae742c)
at
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:261)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at
org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:66)
at
org.apache.flink.streaming.examples.windowing.TopSpeedWindowing.main(TopSpeedWindowing.java:96)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:423)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
at
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: org.apache.flink.runtime.rest.util.RestClientException: [Internal
server error., <Exception on server side:
org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
not set: Ignoring message LocalFencedMessage(ac4f896e2897c5bc96eef1d93be5433d,
LocalRpcInvocation(requestJobStatus(JobID, Time))) sent to
akka.tcp://flink@flinkjobcluster-sample-ha-jobmanager:6123/user/dispatcher
because the fencing token is null.
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
at
akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
End of exception on server side>]
at org.apache.flink.runtime.rest.RestClient.parseResponse(RestClient.java:389)
at
org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$3(RestClient.java:373)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:966)
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940)
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:456)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
> HA + ResourceManager exception: Fencing token not set
> -----------------------------------------------------
>
> Key: FLINK-12382
> URL: https://issues.apache.org/jira/browse/FLINK-12382
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.8.0
> Environment: Same all all previous bugs filed by myself, today, but
> this time with HA with zetcd.
> Reporter: Henrik
> Priority: Major
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill
> both the job-cluster and task-manager at the same time, but maintain ZK/zetcd
> up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also
> resolves itself. I expect a number of retries, and if this exception is part
> of the protocol signalling to retry, then it should not be printed as a log
> entry.
> This might be related to an older bug:
> [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration
> at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException:
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
> not set: Ignoring message
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e,
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int,
> HardwareDescription, Time))) sent to
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing
> token is null.
> [tm] at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm] at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm] at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm] at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm] at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm] at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm] at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm] at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm] at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm] at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm] at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm] at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm] at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm] at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm] at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm] at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm] at
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm] at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by:
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token
> not set: Ignoring message
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e,
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int,
> HardwareDescription, Time))) sent to
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing
> token is null.
> [tm] at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm] at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [tm] at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm] ... 9 more
> [tm] 2019-05-01 11:32:01,650 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and
> re-attempting registration in 10000 ms
> [tm] 2019-05-01 11:32:03,070 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - The heartbeat
> of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] 2019-05-01 11:32:03,070 INFO
> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close
> JobManager connection for job 00000000000000000000000000000000.
> [tm] 2019-05-01 11:32:03,070 INFO org.apache.flink.runtime.taskmanager.Task
> - Attempting to fail task externally Source: Custom Source
> -> Process -> Timestamps/Watermarks -> app_events (1/1)
> (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task
> - Source: Custom Source -> Process ->
> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692)
> switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for
> 00000000000000000000000000000000 lost the leadership.
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm] at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of
> JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm] ... 15 more
> [tm] 2019-05-01 11:32:03,071 INFO org.apache.flink.runtime.taskmanager.Task
> - Triggering cancellation of task code Source: Custom
> Source -> Process -> Timestamps/Watermarks -> app_events (1/1)
> (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task
> - Attempting to fail task externally user_sessions ->
> (Sink: sink_example_sessions, Filter, Filter) (1/1)
> (dbb8434fb24a04b8890520d4e59bbd25).
> [tm] 2019-05-01 11:32:03,085 INFO org.apache.flink.runtime.taskmanager.Task
> - user_sessions -> (Sink: sink_example_sessions, Filter,
> Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to
> FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for
> 00000000000000000000000000000000 lost the leadership.
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm] at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm] at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm] at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm] at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of
> JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] at
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm] ... 15 more
> {code}
> tm stands for taskmanager in this deployment.
> EDIT: this also happens if you just temporarily disable network routing; it
> never recovers on its own despite having HA configured! In this case, it's
> the job manager that keeps crashing.
> {code:java}
> [job] 2019-05-01 13:03:32,299 ERROR
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Unhandled
> exception.
> [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing
> token not set: Ignoring message
> LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917,
> LocalRpcInvocation(requestMultipleJobDetails(Time))) sent to
> akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token
> is null.
> [job] at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [job] at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [job] at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [job] at
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [job] at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [job] at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [job] at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [job] at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [job] at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [job] at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [job] at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [job] at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [job] at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [job] at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [job] at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [job] 2019-05-01 13:03:33,308 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler - Unhandled
> exception.{code}
> Then killing both the TM and JM/RM doesn't work. I let it linger for 5
> minutes in the broken state, and then:
> {code:java}
> //snip
> Fatal error occurred while executing the TaskManager. Shutting it down.
> // snip
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
> Could not register at the ResourceManager within the specified maximum
> registration duration 300000 ms. This indicates a problem with this instance.
> Terminating now.
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)