[ https://issues.apache.org/jira/browse/FLINK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414099#comment-16414099 ]
ASF GitHub Bot commented on FLINK-8887: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/5767 [FLINK-8887] Wait for JobMaster leader election in Dispatcher ## What is the purpose of the change Before sending requests from the Dispatcher to the JobMasters, the Dispatcher must wait until the respective JobMaster has gained leadership. Otherwise we might risk that the messages are ignored because no fencing token was set. This is solved by letting the JobManagerRunner expose a CompletableFuture<JobMasterGateway> which is only completed after the JobMaster has gained leadership. The future is cleared once the leadership is revoked. cc @GJL ## Brief change log - confirm leader session after `JobMaster` is started by `JobManagerRunner` - expose `JobMasterGateway` as a future in `JobManagerRunner` - Wait for the `JobMasterGateway#getLeaderGatewayFuture` completion before sending messages from the `Dispatcher` to the `JobMaster` ## Verifying this change - Added `DispatcherTest#testWaitingForJobMasterLeadership` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink fixFencingToken Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5767.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5767 ---- commit 2790952a1ec76962b3d5b905abc673a35945c63f Author: Till Rohrmann <trohrmann@...> Date: 2018-03-26T15:55:10Z [FLINK-8887] Wait for JobMaster leader election in Dispatcher Before sending requests from the Dispatcher to the JobMasters, the Dispatcher must wait until the respective JobMaster has gained leadership. Otherwise we might risk that the messages are ignored because no fencing token was set. This is solved by letting the JobManagerRunner expose a CompletableFuture<JobMasterGateway> which is only completed after the JobMaster has gained leadership. The future is cleared once the leadership is revoked. ---- > ClusterClient.getJobStatus can throw FencingTokenException > ---------------------------------------------------------- > > Key: FLINK-8887 > URL: https://issues.apache.org/jira/browse/FLINK-8887 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination > Affects Versions: 1.5.0 > Reporter: Gary Yao > Assignee: Till Rohrmann > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > > *Description* > Calling {{RestClusterClient.getJobStatus}} or > {{MiniClusterClient.getJobStatus}} can result in a {{FencingTokenException}}. > *Analysis* > {{Dispatcher.requestJobStatus}} first looks the {{JobManagerRunner}} up by > job id. If a reference is found, {{requestJobStatus}} is called on the > respective instance. If not, the {{ArchivedExecutionGraphStore}} is queried. > However, between the lookup and the method call, the {{JobMaster}} of the > respective job may have lost leadership already (job finished), and has set > the fencing token to {{null}}. > *Stacktrace* > {noformat} > Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: > Fencing token mismatch: Ignoring message LocalFencedMessage(null, > LocalRpcInvocation(requestJobStatus(Time))) because the fencing token null > did not match the expected fencing token b8423c75bc6838244b8c93c8bd4a4f51. > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:73) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132) > at > akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {noformat} > {noformat} > Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException: > Fencing token not set: Ignoring message LocalFencedMessage(null, > LocalRpcInvocation(requestJobStatus(Time))) because the fencing token is null. > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleMessage(FencedAkkaRpcActor.java:56) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$onReceive$1(AkkaRpcActor.java:132) > at > akka.actor.ActorCell$$anonfun$become$1.applyOrElse(ActorCell.scala:544) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)