[ 
https://issues.apache.org/jira/browse/FLINK-19805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224476#comment-17224476
 ] 

Robert Metzger commented on FLINK-19805:
----------------------------------------

What is the problem?
In rare cases, the test fails, because the DefaultExecutionDeploymentReconciler 
expects certain executions to be present on the TaskExecutor, while they are 
not.
The problem is likely more on the JobManager side, tracking the executions 
wrongly.

Current findings:
a) For reproducing the issue locally, I noticed that no heartbeats are send 
from the TaskManager to the JobManager. I initially thought the default 
heartbeat frequency is too low, but it rather seems that in 
{{HeartbeatManagerSenderImpl}} the map of targets is always empty (even at an 
interval of 100 ms).

b) I introduced a log statement into {{SchedulerBase.suspend()}} to log the 
number of tracked deployments by the ExecutionDeploymentTracker after 
suspension, and the number is 0 when the test succeeds and 1 when it fails. 
However, in both cases {{stopTrackingDeploymentOf}} is never called. Maybe a 
new instance is introduced somewhere? I'm currently validating this with a new 
test run.
Also note that in successful cases, the number of tracked deployments can be 1. 
I'm investigating if this is a condition of the failure.

c) In successful cases, heartbeats are only triggered by the JobMaster, but 
never responded by any TaskExecutor (probably because they are not registered). 
The problem can only surface if heartbeats are received by the JobMaster.

d) Even if a heartbeat is received by the JobMaster, and the reconciliation 
detects a mismatch, the job does not always fail: If the job finishes 
"concurrently", the reconciler exception is not reported:
{code}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor           [] - Received 
heartbeat request from 2c0b814278e98801af7cf7a663d486aa.
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG 
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Received 
heartbeat from 69b562cc-9497-4ea9-9820-1518c8a6da34.
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG 
org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentTracker [] - 
executionsByHost = 
{69b562cc-9497-4ea9-9820-1518c8a6da34=[25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0,
 25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0]}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] INFO  
org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler [] - 
Reconciling with taskExecutorHost=69b562cc-9497-4ea9-9820-1518c8a6da34, 
executionDeploymentReport=ExecutionDeploymentReport{executions=[]}, 
expectedDeployedExecutions={25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0=DEPLOYED,
 25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0=DEPLOYED}
19:40:02,257 [flink-akka.actor.default-dispatcher-5] DEBUG 
org.apache.flink.runtime.jobmaster.JobMaster                 [] - Failing 
deployments 
[25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_0_0, 
25c51994b6e81cdc39017bd42618bb86_390ab88bdf3f62c745ec70954a480407_1_0] due to 
no longer being deployed.
19:40:02,258 [flink-akka.actor.default-dispatcher-5] INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher     [] - Job 
25c51994b6e81cdc39017bd42618bb86 reached globally terminal state FINISHED.
{code}



> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster is instable
> -----------------------------------------------------------------------
>
>                 Key: FLINK-19805
>                 URL: https://issues.apache.org/jira/browse/FLINK-19805
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Dian Fu
>            Assignee: Robert Metzger
>            Priority: Blocker
>              Labels: test-stability
>             Fix For: 1.12.0
>
>         Attachments: mvn-2.log
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=8214&view=logs&j=3b6ec2fd-a816-5e75-c775-06fb87cb6670&t=2aff8966-346f-518f-e6ce-de64002a5034
> {code}
> 2020-10-23T21:07:32.6861747Z [ERROR] 
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
>   Time elapsed: 30.182 s  <<< FAILURE!
> 2020-10-23T21:07:32.6862546Z java.lang.AssertionError: Job failed.
> 2020-10-23T21:07:32.6865424Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-10-23T21:07:32.6866512Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-10-23T21:07:32.6867720Z  at 
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-10-23T21:07:32.6868707Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6869428Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6870293Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6871062Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6871954Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-23T21:07:32.6872726Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-23T21:07:32.6873503Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-23T21:07:32.6874393Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-23T21:07:32.6875218Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6876001Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-10-23T21:07:32.6876816Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-10-23T21:07:32.6877475Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-23T21:07:32.6878216Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-23T21:07:32.6879061Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-23T21:07:32.6879819Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-23T21:07:32.6880502Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-23T21:07:32.6881215Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-23T21:07:32.6882109Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-23T21:07:32.6882850Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-23T21:07:32.6884171Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-10-23T21:07:32.6884969Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-10-23T21:07:32.6885641Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-23T21:07:32.6886201Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-23T21:07:32.6886841Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-23T21:07:32.6887378Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-23T21:07:32.6887913Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-23T21:07:32.6888478Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-23T21:07:32.6889109Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-23T21:07:32.6889625Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-23T21:07:32.6890110Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-23T21:07:32.6890607Z Caused by: 
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> 2020-10-23T21:07:32.6891237Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-10-23T21:07:32.6892166Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-10-23T21:07:32.6892827Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:217)
> 2020-10-23T21:07:32.6893382Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:210)
> 2020-10-23T21:07:32.6894048Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:204)
> 2020-10-23T21:07:32.6894667Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:526)
> 2020-10-23T21:07:32.6895205Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:240)
> 2020-10-23T21:07:32.6895872Z  at 
> org.apache.flink.runtime.jobmaster.DefaultExecutionDeploymentReconciler.reconcileExecutionDeployments(DefaultExecutionDeploymentReconciler.java:55)
> 2020-10-23T21:07:32.6896633Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1234)
> 2020-10-23T21:07:32.6897239Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.reportPayload(JobMaster.java:1221)
> 2020-10-23T21:07:32.6897834Z  at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl.receiveHeartbeat(HeartbeatManagerImpl.java:199)
> 2020-10-23T21:07:32.6898395Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster.heartbeatFromTaskManager(JobMaster.java:672)
> 2020-10-23T21:07:32.6898846Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T21:07:32.6899289Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T21:07:32.6899816Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T21:07:32.6900257Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T21:07:32.6900723Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:281)
> 2020-10-23T21:07:32.6901265Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:201)
> 2020-10-23T21:07:32.6901938Z  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-10-23T21:07:32.6902497Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:154)
> 2020-10-23T21:07:32.6902977Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-10-23T21:07:32.6903678Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6904110Z  at 
> scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> 2020-10-23T21:07:32.6904545Z  at 
> scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> 2020-10-23T21:07:32.6904974Z  at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-10-23T21:07:32.6905435Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-10-23T21:07:32.6905893Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906325Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> 2020-10-23T21:07:32.6906830Z  at 
> akka.actor.Actor.aroundReceive(Actor.scala:517)
> 2020-10-23T21:07:32.6907220Z  at 
> akka.actor.Actor.aroundReceive$(Actor.scala:515)
> 2020-10-23T21:07:32.6907618Z  at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-10-23T21:07:32.6908050Z  at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-10-23T21:07:32.6908458Z  at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-10-23T21:07:32.6908831Z  at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-10-23T21:07:32.6909213Z  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-10-23T21:07:32.6909554Z  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-10-23T21:07:32.6909959Z  at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-10-23T21:07:32.6910436Z  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-10-23T21:07:32.6910887Z  at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-10-23T21:07:32.6911360Z  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-10-23T21:07:32.6913020Z Caused by: org.apache.flink.util.FlinkException: 
> Execution 
> ff7439fa86b9f67e46b2b6715829af00_dccf9918c07aa47eb2b28a1de42a640f_3_0 is 
> unexpectedly no longer running on task executor 
> 3d5d979c-6898-4593-935e-f0914738d325.
> 2020-10-23T21:07:32.6913798Z  at 
> org.apache.flink.runtime.jobmaster.JobMaster$1.onMissingDeploymentsOf(JobMaster.java:244)
> 2020-10-23T21:07:32.6914167Z  ... 33 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to