[
https://issues.apache.org/jira/browse/FLINK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224476#comment-17224476
]
Robert Metzger commented on FLINK-19805:
----------------------------------------
What is the problem?
In rare cases, the test fails, because the DefaultExecutionDeploymentReconciler
expects certain executions to be present on the TaskExecutor, while they are
not.
The problem is likely more on the JobManager side, tracking the executions
wrongly.
Current findings:
a) For reproducing the issue locally, I noticed that no heartbeats are send
from the TaskManager to the JobManager. I initially thought the default
heartbeat frequency is too low, but it rather seems that in
{{HeartbeatManagerSenderImpl}} the map of targets is always empty (even at an
interval of 100 ms).
b) I introduced a log statement into {{SchedulerBase.suspend()}} to log the
number of tracked deployments by the ExecutionDeploymentTracker after
suspension, and the number is 0 when the test succeeds and 1 when it fails.
However, in both cases {{stopTrackingDeploymentOf}} is never called. Maybe a
new instance is introduced somewhere? I'm currently validating this with a new
test run.
Also note that in successful cases, the number of tracked deployments can be 1.
I'm investigating if this is a condition of the failure.
c) In successful cases, heartbeats are only triggered by the JobMaster, but
never responded by any TaskExecutor (probably because they are not registered).
The problem can only surface if heartbeats are received by the JobMaster.
d) Even if a heartbeat is received by the JobMaster, and the reconciliation
detects a mismatch, the job does not always fail: If the job finishes
"concurrently", the reconciler exception is not reported:
{code}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Received
heartbeat request from 2c0b814278e98801af7cf7a663d486aa.
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster [] - Received
heartbeat from 69b562cc-9497-4ea9-9820-1518c8a6da34.
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG
org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentTracker [] -
executionsByHost =
{69b562cc-9497-4ea9-9820-1518c8a6da34=[25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0,
25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0]}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler [] -
Reconciling with taskExecutorHost=69b562cc-9497-4ea9-9820-1518c8a6da34,
executionDeploymentReport=ExecutionDeploymentReport{executions=[]},
expectedDeployedExecutions={25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0=DEPLOYED,
25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0=DEPLOYED}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG
org.apache.flink.runtime.jobmaster.JobMaster [] - Failing
deployments
[25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0,
25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0] due to
no longer being deployed.
19:40:02,258 [flink-akka.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
25c51994b6e81cdc39017bd42618bb86 reached globally terminal state FINISHED.
{code}
> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster is instable
> -----------------------------------------------------------------------
>
> Key: FLINK-19805
> URL: https://issues.apache.org/jira/browse/FLINK-19805
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Assignee: Robert Metzger
> Priority: Blocker
> Labels: test-stability
> Fix For: 1.12.0
>
> Attachments: mvn-2.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8214&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=2aff8966-346f-518f-e6ce-de64002a5034
> {code}
> 2020-10-23T21:07:32.6861747Z [ERROR]
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
> Time elapsed: 30.182 s <<< FAILURE!
> 2020-10-23T21:07:32.6862546Z java.lang.AssertionError: Job failed.
> 2020-10-23T21:07:32.6865424Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-10-23T21:07:32.6866512Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-10-23T21:07:32.6867720Z at
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-10-23T21:07:32.6868707Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6869428Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6870293Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6871062Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6871954Z at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-23T21:07:32.6872726Z at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-23T21:07:32.6873503Z at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-23T21:07:32.6874393Z at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-23T21:07:32.6875218Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6876001Z at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-10-23T21:07:32.6876816Z at
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-10-23T21:07:32.6877475Z at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-23T21:07:32.6878216Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-23T21:07:32.6879061Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-23T21:07:32.6879819Z at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-23T21:07:32.6880502Z at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-23T21:07:32.6881215Z at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-23T21:07:32.6882109Z at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-23T21:07:32.6882850Z at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-23T21:07:32.6884171Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6884969Z at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-10-23T21:07:32.6885641Z at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-23T21:07:32.6886201Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-23T21:07:32.6886841Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-23T21:07:32.6887378Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-23T21:07:32.6887913Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-23T21:07:32.6888478Z at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-23T21:07:32.6889109Z at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-23T21:07:32.6889625Z at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-23T21:07:32.6890110Z at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-23T21:07:32.6890607Z Caused by:
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> 2020-10-23T21:07:32.6891237Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-10-23T21:07:32.6892166Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-10-23T21:07:32.6892827Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:217)
> 2020-10-23T21:07:32.6893382Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:210)
> 2020-10-23T21:07:32.6894048Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:204)
> 2020-10-23T21:07:32.6894667Z at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:526)
> 2020-10-23T21:07:32.6895205Z at
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:240)
> 2020-10-23T21:07:32.6895872Z at
> org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler.reconcileExecutionDeployments(DefaultExecutionDeploymentReconciler.java:55)
> 2020-10-23T21:07:32.6896633Z at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1234)
> 2020-10-23T21:07:32.6897239Z at
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1221)
> 2020-10-23T21:07:32.6897834Z at
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.receiveHeartbeat(HeartbeatManagerImpl.java:199)
> 2020-10-23T21:07:32.6898395Z at
> org.apache.flink.runtime.jobmaster.JobMaster.heartbeatFromTaskManager(JobMaster.java:672)
> 2020-10-23T21:07:32.6898846Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6899289Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6899816Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6900257Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6900723Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:281)
> 2020-10-23T21:07:32.6901265Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
> 2020-10-23T21:07:32.6901938Z at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-10-23T21:07:32.6902497Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> 2020-10-23T21:07:32.6902977Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-10-23T21:07:32.6903678Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6904110Z at
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> 2020-10-23T21:07:32.6904545Z at
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> 2020-10-23T21:07:32.6904974Z at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6905435Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-10-23T21:07:32.6905893Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906325Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906830Z at
> akka.actor.Actor.aroundReceive(Actor.scala:517)
> 2020-10-23T21:07:32.6907220Z at
> akka.actor.Actor.aroundReceive$(Actor.scala:515)
> 2020-10-23T21:07:32.6907618Z at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-10-23T21:07:32.6908050Z at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-10-23T21:07:32.6908458Z at
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-10-23T21:07:32.6908831Z at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-10-23T21:07:32.6909213Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-10-23T21:07:32.6909554Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-10-23T21:07:32.6909959Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-10-23T21:07:32.6910436Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-10-23T21:07:32.6910887Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-10-23T21:07:32.6911360Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-10-23T21:07:32.6913020Z Caused by: org.apache.flink.util.FlinkException:
> Execution
> ff7439fa86b9f67e46b2b6715829af00_dccf9918c07aa47eb2b28a1de42a640f_3_0 is
> unexpectedly no longer running on task executor
> 3d5d979c-6898-4593-935e-f0914738d325.
> 2020-10-23T21:07:32.6913798Z at
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:244)
> 2020-10-23T21:07:32.6914167Z ... 33 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)