[
https://issues.apache.org/jira/browse/FLINK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224609#comment-17224609
]
Robert Metzger commented on FLINK-19805:
----------------------------------------
After an offline discussion we concluded that this error is caused by the
ExecutionAttemptId being reused across different leader sessions. The reported
error is the Task's failure to report it's status to the JobManager. This gets
reported through the new leader session to the JobMaster, which can not
distinguish if this failure is coming from the current of previous execution
attempt.
This problem has been introduced by FLINK-17295. I will now try to revert
FLINK-17295 to see if that makes the test stable.
I will also introduce an assertion into the ExecutionGraph that the number of
tracked deployments by the DefaultExecutionDeploymentTracker is always 0 after
suspending the execution. This might help uncover further problems.
> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster is instable
> -----------------------------------------------------------------------
>
> Key: FLINK-19805
> URL: https://issues.apache.org/jira/browse/FLINK-19805
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Assignee: Robert Metzger
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.12.0
>
> Attachments: mvn-2.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8214&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=2aff8966-346f-518f-e6ce-de64002a5034
> {code}
> 2020-10-23T21:07:32.6861747Z [ERROR]
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
> Time elapsed: 30.182 s <<< FAILURE!
> 2020-10-23T21:07:32.6862546Z java.lang.AssertionError: Job failed.
> 2020-10-23T21:07:32.6865424Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-10-23T21:07:32.6866512Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-10-23T21:07:32.6867720Z at
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-10-23T21:07:32.6868707Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6869428Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6870293Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6871062Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6871954Z at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-23T21:07:32.6872726Z at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-23T21:07:32.6873503Z at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-23T21:07:32.6874393Z at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-23T21:07:32.6875218Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6876001Z at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-10-23T21:07:32.6876816Z at
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-10-23T21:07:32.6877475Z at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-23T21:07:32.6878216Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-23T21:07:32.6879061Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-23T21:07:32.6879819Z at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-23T21:07:32.6880502Z at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-23T21:07:32.6881215Z at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-23T21:07:32.6882109Z at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-23T21:07:32.6882850Z at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-23T21:07:32.6884171Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6884969Z at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-10-23T21:07:32.6885641Z at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-23T21:07:32.6886201Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-23T21:07:32.6886841Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-23T21:07:32.6887378Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-23T21:07:32.6887913Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-23T21:07:32.6888478Z at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-23T21:07:32.6889109Z at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-23T21:07:32.6889625Z at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-23T21:07:32.6890110Z at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-23T21:07:32.6890607Z Caused by:
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> 2020-10-23T21:07:32.6891237Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-10-23T21:07:32.6892166Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-10-23T21:07:32.6892827Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:217)
> 2020-10-23T21:07:32.6893382Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:210)
> 2020-10-23T21:07:32.6894048Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:204)
> 2020-10-23T21:07:32.6894667Z at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:526)
> 2020-10-23T21:07:32.6895205Z at
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:240)
> 2020-10-23T21:07:32.6895872Z at
> org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler.reconcileExecutionDeployments(DefaultExecutionDeploymentReconciler.java:55)
> 2020-10-23T21:07:32.6896633Z at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1234)
> 2020-10-23T21:07:32.6897239Z at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1221)
> 2020-10-23T21:07:32.6897834Z at
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.receiveHeartbeat(HeartbeatManagerImpl.java:199)
> 2020-10-23T21:07:32.6898395Z at
> org.apache.flink.runtime.jobmaster.JobMaster.heartbeatFromTaskManager(JobMaster.java:672)
> 2020-10-23T21:07:32.6898846Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6899289Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6899816Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6900257Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6900723Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:281)
> 2020-10-23T21:07:32.6901265Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
> 2020-10-23T21:07:32.6901938Z at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-10-23T21:07:32.6902497Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> 2020-10-23T21:07:32.6902977Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-10-23T21:07:32.6903678Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6904110Z at
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> 2020-10-23T21:07:32.6904545Z at
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> 2020-10-23T21:07:32.6904974Z at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6905435Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-10-23T21:07:32.6905893Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906325Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906830Z at
> akka.actor.Actor.aroundReceive(Actor.scala:517)
> 2020-10-23T21:07:32.6907220Z at
> akka.actor.Actor.aroundReceive$(Actor.scala:515)
> 2020-10-23T21:07:32.6907618Z at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-10-23T21:07:32.6908050Z at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-10-23T21:07:32.6908458Z at
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-10-23T21:07:32.6908831Z at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-10-23T21:07:32.6909213Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-10-23T21:07:32.6909554Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-10-23T21:07:32.6909959Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-10-23T21:07:32.6910436Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-10-23T21:07:32.6910887Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-10-23T21:07:32.6911360Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-10-23T21:07:32.6913020Z Caused by: org.apache.flink.util.FlinkException:
> Execution
> ff7439fa86b9f67e46b2b6715829af00_dccf9918c07aa47eb2b28a1de42a640f_3_0 is
> unexpectedly no longer running on task executor
> 3d5d979c-6898-4593-935e-f0914738d325.
> 2020-10-23T21:07:32.6913798Z at
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:244)
> 2020-10-23T21:07:32.6914167Z ... 33 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)