[ 
https://issues.apache.org/jira/browse/FLINK-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150106#comment-17150106
 ] 

Till Rohrmann commented on FLINK-12382:
---------------------------------------

Hi [~gael], the attached log files don't seem to be complete which makes it 
impossible to understand what was going on. In order to further debug the 
problem the complete logs and also the logs of the {{TaskManagers}} would be 
helpful. 

Moreover, it would be helpful to exactly know the environment in which you are 
running Flink. Recently, there has been a related issue which you can find here 
FLINK-18367.

> HA + ResourceManager exception: Fencing token not set
> -----------------------------------------------------
>
>                 Key: FLINK-12382
>                 URL: https://issues.apache.org/jira/browse/FLINK-12382
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.8.0
>         Environment: Same all all previous bugs filed by myself, today, but 
> this time with HA with zetcd.
>            Reporter: Henrik
>            Priority: Major
>         Attachments: jobmanager_log, jobmanager_stdout
>
>
> I'm testing zetcd + session jobs in k8s, and testing what happens when I kill 
> both the job-cluster and task-manager at the same time, but maintain ZK/zetcd 
> up and running.
> Then I get this stacktrace, that's completely non-actionable for me, and also 
> resolves itself. I expect a number of retries, and if this exception is part 
> of the protocol signalling to retry, then it should not be printed as a log 
> entry.
> This might be related to an older bug: 
> [https://jira.apache.org/jira/browse/FLINK-7734]
> {code:java}
> [tm] 2019-05-01 11:32:01,641 ERROR 
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration 
> at ResourceManager failed due to an error
> [tm] java.util.concurrent.CompletionException: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> [tm]     at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> [tm]     at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> [tm]     at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> [tm]     at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> [tm]     at 
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:815)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:258)
> [tm]     at akka.dispatch.OnComplete.internal(Future.scala:256)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> [tm]     at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> [tm]     at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> [tm]     at 
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> [tm]     at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> [tm]     at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> [tm]     at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:534)
> [tm]     at akka.remote.DefaultMessageDispatcher.dispatch(Endpoint.scala:97)
> [tm]     at 
> akka.remote.EndpointReader$$anonfun$receive$2.applyOrElse(Endpoint.scala:982)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:446)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: 
> org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing token 
> not set: Ignoring message 
> RemoteFencedMessage(b8fb92460bdd5c250de1415e6cf04d4e, 
> RemoteRpcInvocation(registerTaskExecutor(String, ResourceID, int, 
> HardwareDescription, Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/resourcemanager because the fencing 
> token is null.
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [tm]     at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     ... 9 more
> [tm] 2019-05-01 11:32:01,650 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Pausing and 
> re-attempting registration in 10000 ms
> [tm] 2019-05-01 11:32:03,070 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - The heartbeat 
> of JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm] 2019-05-01 11:32:03,070 INFO  
> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close 
> JobManager connection for job 00000000000000000000000000000000.
> [tm] 2019-05-01 11:32:03,070 INFO  org.apache.flink.runtime.taskmanager.Task  
>                    - Attempting to fail task externally Source: Custom Source 
> -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task  
>                    - Source: Custom Source -> Process -> 
> Timestamps/Watermarks -> app_events (1/1) (a302013f150f292067cd498100dc6692) 
> switched from RUNNING to FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 
> 00000000000000000000000000000000 lost the leadership.
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of 
> JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> [tm] 2019-05-01 11:32:03,071 INFO  org.apache.flink.runtime.taskmanager.Task  
>                    - Triggering cancellation of task code Source: Custom 
> Source -> Process -> Timestamps/Watermarks -> app_events (1/1) 
> (a302013f150f292067cd498100dc6692).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task  
>                    - Attempting to fail task externally user_sessions -> 
> (Sink: sink_example_sessions, Filter, Filter) (1/1) 
> (dbb8434fb24a04b8890520d4e59bbd25).
> [tm] 2019-05-01 11:32:03,085 INFO  org.apache.flink.runtime.taskmanager.Task  
>                    - user_sessions -> (Sink: sink_example_sessions, Filter, 
> Filter) (1/1) (dbb8434fb24a04b8890520d4e59bbd25) switched from RUNNING to 
> FAILED.
> [tm] org.apache.flink.util.FlinkException: JobManager responsible for 
> 00000000000000000000000000000000 lost the leadership.
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJobManagerConnection(TaskExecutor.java:1182)
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor.access$1200(TaskExecutor.java:136)
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1624)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:392)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:185)
> [tm]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [tm]     at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [tm]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [tm]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [tm]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [tm]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [tm]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [tm]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [tm]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [tm]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [tm] Caused by: java.util.concurrent.TimeoutException: The heartbeat of 
> JobManager with id 3642aa576f132fecd6811ae0d314c2b5 timed out.
> [tm]     at 
> org.apache.flink.runtime.taskexecutor.TaskExecutor$JobManagerHeartbeatListener.lambda$notifyHeartbeatTimeout$0(TaskExecutor.java:1625)
> [tm]     ... 15 more
> {code}
> tm stands for taskmanager in this deployment.
> EDIT: this also happens if you just temporarily disable network routing; it 
> never recovers on its own despite having HA configured! In this case, it's 
> the job manager that keeps crashing.
> {code:java}
> [job] 2019-05-01 13:03:32,299 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler  - Unhandled 
> exception.
> [job] org.apache.flink.runtime.rpc.exceptions.FencingTokenException: Fencing 
> token not set: Ignoring message 
> LocalFencedMessage(aa2545e3e2ca903b1a0f331235954917, 
> LocalRpcInvocation(requestMultipleJobDetails(Time))) sent to 
> akka.tcp://flink@analytics-job:6123/user/dispatcher because the fencing token 
> is null.
> [job]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:63)
> [job]     at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:147)
> [job]     at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
> [job]     at 
> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
> [job]     at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
> [job]     at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
> [job]     at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
> [job]     at akka.actor.ActorCell.invoke(ActorCell.scala:495)
> [job]     at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
> [job]     at akka.dispatch.Mailbox.run(Mailbox.scala:224)
> [job]     at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
> [job]     at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> [job]     at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> [job]     at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> [job]     at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> [job] 2019-05-01 13:03:33,308 ERROR 
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler   - Unhandled 
> exception.{code}
>  Then killing both the TM and JM/RM doesn't work. I let it linger for 5 
> minutes in the broken state, and then:
> {code:java}
> //snip
> Fatal error occurred while executing the TaskManager. Shutting it down.
> // snip
> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>  Could not register at the ResourceManager within the specified maximum 
> registration duration 300000 ms. This indicates a problem with this instance. 
> Terminating now.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to