[
https://issues.apache.org/jira/browse/FLINK-15661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135670#comment-17135670
]
Till Rohrmann commented on FLINK-15661:
---------------------------------------
It looks as if the root problem is an unstable ZooKeeper connection in the
test. Due to this, it happens the following:
1) Dispatcher gets leadership granted & announces its address
2) Test case obtains leader address & tries to connect to it
3) Due to a ZooKeeper timeout, the Dispatcher loses the leadership
4) It regains it right away after the ZooKeeper connection is reconnected &
announces a new leader address
5) The test fails because it still tries to connect to the old leader address
I see two options: Either hardening the whole test to tolerate ZooKeeper
outages because it can happen at any point in time. Alternatively, we could
remove this test if we have an e2e test which covers the same logic.
[~rmetzger] I am bit surprised to see these ZooKeeper timeouts to occur much
more often on our new AZP CI infrastructure than we used to see it on Travis
(never). I fear that many ZooKeeper tests actually assume that the ZooKeeper
connection is stable and might now be susceptible to timeout issues. Can we
figure out whether this has something to do with the underlying infrastructure?
The logs look as if there are some time jumps as if the process only gets very
little processing time. If there is nothing we can do about it, then we need to
configure our ZooKeeper test servers to have a larger connection timeout.
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure failed
> because of Could not find Flink job
> -----------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-15661
> URL: https://issues.apache.org/jira/browse/FLINK-15661
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.11.0
> Reporter: Congxian Qiu(klion26)
> Priority: Critical
> Labels: test-stability
>
> 2020-01-19T06:25:02.3856954Z [ERROR]
> JobManagerHAProcessFailureRecoveryITCase.testDispatcherProcessFailure:347 The
> program encountered a ExecutionException :
> org.apache.flink.runtime.rest.util.RestClientException:
> [org.apache.flink.runtime.rest.handler.RestHandlerException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3857171Z at
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.propagateException(JobExecutionResultHandler.java:91)
> 2020-01-19T06:25:02.3857571Z at
> org.apache.flink.runtime.rest.handler.job.JobExecutionResultHandler.lambda$handleRequest$1(JobExecutionResultHandler.java:82)
> 2020-01-19T06:25:02.3857866Z at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> 2020-01-19T06:25:02.3857982Z at
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
> 2020-01-19T06:25:02.3859852Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> 2020-01-19T06:25:02.3860440Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> 2020-01-19T06:25:02.3860732Z at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:872)
> 2020-01-19T06:25:02.3860960Z at
> akka.dispatch.OnComplete.internal(Future.scala:263)
> 2020-01-19T06:25:02.3861099Z at
> akka.dispatch.OnComplete.internal(Future.scala:261)
> 2020-01-19T06:25:02.3861232Z at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:191)
> 2020-01-19T06:25:02.3861391Z at
> akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)
> 2020-01-19T06:25:02.3861546Z at
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2020-01-19T06:25:02.3861712Z at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:74)
> 2020-01-19T06:25:02.3861809Z at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> 2020-01-19T06:25:02.3861916Z at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> 2020-01-19T06:25:02.3862221Z at
> akka.pattern.PromiseActorRef.$bang(AskSupport.scala:572)
> 2020-01-19T06:25:02.3862475Z at
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:23)
> 2020-01-19T06:25:02.3862626Z at
> akka.pattern.PipeToSupport$PipeableFuture$$anonfun$pipeTo$1.applyOrElse(PipeToSupport.scala:21)
> 2020-01-19T06:25:02.3862736Z at
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:436)
> 2020-01-19T06:25:02.3862820Z at
> scala.concurrent.Future$$anonfun$andThen$1.apply(Future.scala:435)
> 2020-01-19T06:25:02.3867146Z at
> scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> 2020-01-19T06:25:02.3867318Z at
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> 2020-01-19T06:25:02.3867441Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867552Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867664Z at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> 2020-01-19T06:25:02.3867763Z at
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
> 2020-01-19T06:25:02.3867843Z at
> akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
> 2020-01-19T06:25:02.3867936Z at
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
> 2020-01-19T06:25:02.3868036Z at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
> 2020-01-19T06:25:02.3868145Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-01-19T06:25:02.3868223Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-01-19T06:25:02.3868313Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-01-19T06:25:02.3868390Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-01-19T06:25:02.3868520Z Caused by:
> java.util.concurrent.CompletionException:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3868625Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJobStatus$17(Dispatcher.java:516)
> 2020-01-19T06:25:02.3868734Z at
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
> 2020-01-19T06:25:02.3868831Z at
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:884)
> 2020-01-19T06:25:02.3869143Z at
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2196)
> 2020-01-19T06:25:02.3869241Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:510)
> 2020-01-19T06:25:02.3869319Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-01-19T06:25:02.3869418Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-01-19T06:25:02.3869506Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-01-19T06:25:02.3869602Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-01-19T06:25:02.3869681Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:279)
> 2020-01-19T06:25:02.3869780Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:194)
> 2020-01-19T06:25:02.3869865Z at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-01-19T06:25:02.3869982Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-01-19T06:25:02.3870062Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-01-19T06:25:02.3870153Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-01-19T06:25:02.3870228Z at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-01-19T06:25:02.3870399Z at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-01-19T06:25:02.3870481Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-01-19T06:25:02.3870571Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-01-19T06:25:02.3870646Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-01-19T06:25:02.3870733Z at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-01-19T06:25:02.3870911Z at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-01-19T06:25:02.3871013Z at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-01-19T06:25:02.3871086Z at
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-01-19T06:25:02.3871170Z at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-01-19T06:25:02.3871350Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-01-19T06:25:02.3871439Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-01-19T06:25:02.3871509Z ... 4 more
> 2020-01-19T06:25:02.3871618Z Caused by:
> org.apache.flink.runtime.messages.FlinkJobNotFoundException: Could not find
> Flink job (47fe3e8df0e59994938485f683d1410e)
> 2020-01-19T06:25:02.3871721Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.getJobMasterGatewayFuture(Dispatcher.java:776)
> 2020-01-19T06:25:02.3871827Z at
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJobStatus(Dispatcher.java:505)
> 2020-01-19T06:25:02.3871903Z ... 26 more
> 2020-01-19T06:25:02.3871975Z ]
>
> [https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_apis/build/builds/4461/logs/15]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)