[
https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16635360#comment-16635360
]
Till Rohrmann edited comment on FLINK-10475 at 10/2/18 12:11 PM:
-----------------------------------------------------------------
Hi [~Jamalarm], this sounds as if ZooKeeper did not notice the one JM being
killed. Thus, it could simply be a ZooKeeper setup problem.
In order to further debug the problem, it would be helpful to get the logs of
the JobManagers.
The error messages originate from the REST handlers and are not a critical
problem.
was (Author: till.rohrmann):
Hi [~Jamalarm], this sounds as if ZooKeeper did not notice the one JM being
killed. Thus, it could simply be a ZooKeeper setup problem.
In order to further debug the problem, it would be helpful to get the logs of
the JobManagers.
> Standalone HA - Leader election is not triggered on loss of leader
> ------------------------------------------------------------------
>
> Key: FLINK-10475
> URL: https://issues.apache.org/jira/browse/FLINK-10475
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.5.4
> Reporter: Thomas Wozniakowski
> Priority: Blocker
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4. Happy to see that the issue of
> jobgraphs hanging around forever has been resolved in standalone/zookeeper HA
> mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of
> zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new
> version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got
> stuck.
> The logs of the remaining job managers were full of this:
> {quote}
> 2018-10-01 15:35:44,558 ERROR
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler - Could not
> retrieve the redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException:
> Ask timed out on
> [Actor[akka.tcp://[email protected]:50010/user/dispatcher#-1286445443]] after
> [10000 ms]. Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:770)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:745)
> {quote}
> Please give me a shout if I can provide any more useful information
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)