[ 
https://issues.apache.org/jira/browse/FLINK-15347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077348#comment-17077348
 ] 

Till Rohrmann commented on FLINK-15347:
---------------------------------------

The problem seems to be a race condition between two {{Dispatchers}} between 
two leader sessions. Before the second leader instance can be created, the 
former needs to be unregistered from the underlying {{AkkaRpcService}} because 
both share the same endpoint name {{dispatcher}}. If the old leader is not 
completely unregistered, then one sees the following exception

{code}
java.util.concurrent.CompletionException: 
org.apache.flink.util.FlinkRuntimeException: Could not create the Dispatcher 
rpc endpoint.
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
        at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:659)
        at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:632)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1595)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.util.FlinkRuntimeException: Could not create the 
Dispatcher rpc endpoint.
        at 
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherGatewayServiceFactory.create(DefaultDispatcherGatewayServiceFactory.java:66)
        at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.createDispatcher(SessionDispatcherLeaderProcess.java:100)
        at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.lambda$createDispatcherIfRunning$0(SessionDispatcherLeaderProcess.java:95)
        at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.runIfState(AbstractDispatcherLeaderProcess.java:210)
        at 
org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.runIfStateIs(AbstractDispatcherLeaderProcess.java:198)
        at 
org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.createDispatcherIfRunning(SessionDispatcherLeaderProcess.java:95)
        at 
java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:656)
        ... 10 more
Caused by: akka.actor.InvalidActorNameException: actor name [dispatcher] is not 
unique!
        at 
akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(ChildrenContainer.scala:129)
        at akka.actor.dungeon.Children$class.reserveChild(Children.scala:135)
        at akka.actor.ActorCell.reserveChild(ActorCell.scala:429)
        at akka.actor.dungeon.Children$class.makeChild(Children.scala:275)
        at akka.actor.dungeon.Children$class.attachChild(Children.scala:49)
        at akka.actor.ActorCell.attachChild(ActorCell.scala:429)
        at akka.actor.ActorSystemImpl.actorOf(ActorSystem.scala:753)
        at 
org.apache.flink.runtime.rpc.akka.AkkaRpcService.startServer(AkkaRpcService.java:219)
        at org.apache.flink.runtime.rpc.RpcEndpoint.<init>(RpcEndpoint.java:129)
        at 
org.apache.flink.runtime.rpc.FencedRpcEndpoint.<init>(FencedRpcEndpoint.java:48)
        at 
org.apache.flink.runtime.rpc.PermanentlyFencedRpcEndpoint.<init>(PermanentlyFencedRpcEndpoint.java:36)
        at 
org.apache.flink.runtime.dispatcher.Dispatcher.<init>(Dispatcher.java:137)
        at 
org.apache.flink.runtime.dispatcher.StandaloneDispatcher.<init>(StandaloneDispatcher.java:39)
        at 
org.apache.flink.runtime.dispatcher.SessionDispatcherFactory.createDispatcher(SessionDispatcherFactory.java:44)
        at 
org.apache.flink.runtime.dispatcher.SessionDispatcherFactory.createDispatcher(SessionDispatcherFactory.java:29)
        at 
org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherGatewayServiceFactory.create(DefaultDispatcherGatewayServiceFactory.java:60)
        ... 16 more

{code}

> ZooKeeperDefaultDispatcherRunnerTest.testResourceCleanupUnderLeadershipChange 
> failed on Travis
> ----------------------------------------------------------------------------------------------
>
>                 Key: FLINK-15347
>                 URL: https://issues.apache.org/jira/browse/FLINK-15347
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.11.0
>
>
> The test 
> {{ZooKeeperDefaultDispatcherRunnerTest.testResourceCleanupUnderLeadershipChange}}
>  failed on Travis because it got stuck.
> https://api.travis-ci.org/v3/job/627661879/log.txt



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to