[ 
https://issues.apache.org/jira/browse/FLINK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224609#comment-17224609
 ] 

Robert Metzger commented on FLINK-19805:
----------------------------------------

After an offline discussion we concluded that this error is caused by the 
ExecutionAttemptId being reused across different leader sessions. The reported 
error is the Task's failure to report it's status to the JobManager. This gets 
reported through the new leader session to the JobMaster, which can not 
distinguish if this failure is coming from the current of previous execution 
attempt.

This problem has been introduced by FLINK-17295. I will now try to revert 
FLINK-17295 to see if that makes the test stable.

I will also introduce an assertion into the ExecutionGraph that the number of 
tracked deployments by the DefaultExecutionDeploymentTracker is always 0 after 
suspending the execution. This might help uncover further problems.

> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster is instable
> -----------------------------------------------------------------------
>
>                 Key: FLINK-19805
>                 URL: https://issues.apache.org/jira/browse/FLINK-19805
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Dian Fu
>            Assignee: Robert Metzger
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.12.0
>
>         Attachments: mvn-2.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8214&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=2aff8966-346f-518f-e6ce-de64002a5034
> {code}
> 2020-10-23T21:07:32.6861747Z [ERROR] 
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
>   Time elapsed: 30.182 s  <<< FAILURE!
> 2020-10-23T21:07:32.6862546Z java.lang.AssertionError: Job failed.
> 2020-10-23T21:07:32.6865424Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-10-23T21:07:32.6866512Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-10-23T21:07:32.6867720Z  at 
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-10-23T21:07:32.6868707Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6869428Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6870293Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6871062Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6871954Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-23T21:07:32.6872726Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-23T21:07:32.6873503Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-23T21:07:32.6874393Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-23T21:07:32.6875218Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6876001Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-10-23T21:07:32.6876816Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-10-23T21:07:32.6877475Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-23T21:07:32.6878216Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-23T21:07:32.6879061Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-23T21:07:32.6879819Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-23T21:07:32.6880502Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-23T21:07:32.6881215Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-23T21:07:32.6882109Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-23T21:07:32.6882850Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-23T21:07:32.6884171Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6884969Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-10-23T21:07:32.6885641Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-23T21:07:32.6886201Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-23T21:07:32.6886841Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-23T21:07:32.6887378Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-23T21:07:32.6887913Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-23T21:07:32.6888478Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-23T21:07:32.6889109Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-23T21:07:32.6889625Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-23T21:07:32.6890110Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-23T21:07:32.6890607Z Caused by: 
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> 2020-10-23T21:07:32.6891237Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-10-23T21:07:32.6892166Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-10-23T21:07:32.6892827Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:217)
> 2020-10-23T21:07:32.6893382Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:210)
> 2020-10-23T21:07:32.6894048Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:204)
> 2020-10-23T21:07:32.6894667Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:526)
> 2020-10-23T21:07:32.6895205Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:240)
> 2020-10-23T21:07:32.6895872Z  at 
> org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler.reconcileExecutionDeployments(DefaultExecutionDeploymentReconciler.java:55)
> 2020-10-23T21:07:32.6896633Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1234)
> 2020-10-23T21:07:32.6897239Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1221)
> 2020-10-23T21:07:32.6897834Z  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.receiveHeartbeat(HeartbeatManagerImpl.java:199)
> 2020-10-23T21:07:32.6898395Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster.heartbeatFromTaskManager(JobMaster.java:672)
> 2020-10-23T21:07:32.6898846Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6899289Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6899816Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6900257Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6900723Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:281)
> 2020-10-23T21:07:32.6901265Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
> 2020-10-23T21:07:32.6901938Z  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-10-23T21:07:32.6902497Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> 2020-10-23T21:07:32.6902977Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-10-23T21:07:32.6903678Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6904110Z  at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> 2020-10-23T21:07:32.6904545Z  at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> 2020-10-23T21:07:32.6904974Z  at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6905435Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-10-23T21:07:32.6905893Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906325Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906830Z  at 
> akka.actor.Actor.aroundReceive(Actor.scala:517)
> 2020-10-23T21:07:32.6907220Z  at 
> akka.actor.Actor.aroundReceive$(Actor.scala:515)
> 2020-10-23T21:07:32.6907618Z  at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-10-23T21:07:32.6908050Z  at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-10-23T21:07:32.6908458Z  at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-10-23T21:07:32.6908831Z  at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-10-23T21:07:32.6909213Z  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-10-23T21:07:32.6909554Z  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-10-23T21:07:32.6909959Z  at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-10-23T21:07:32.6910436Z  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-10-23T21:07:32.6910887Z  at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-10-23T21:07:32.6911360Z  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-10-23T21:07:32.6913020Z Caused by: org.apache.flink.util.FlinkException: 
> Execution 
> ff7439fa86b9f67e46b2b6715829af00_dccf9918c07aa47eb2b28a1de42a640f_3_0 is 
> unexpectedly no longer running on task executor 
> 3d5d979c-6898-4593-935e-f0914738d325.
> 2020-10-23T21:07:32.6913798Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:244)
> 2020-10-23T21:07:32.6914167Z  ... 33 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to