[
https://issues.apache.org/jira/browse/FLINK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208592#comment-17208592
]
Robert Metzger commented on FLINK-19237:
----------------------------------------
In the normal case, it seems that:
a) the job gets submitted
b) operators get scheduled, but slot requests cannot be served yet (no RM
connected),
c) RM registration starts, but doesn't finish
d) JM loses leadership, job gets suspended
e) JM regains leadership
f) job starts running, test succeeds.
In the failure case:
1. Job gets submitted
2. Operators get scheduled, but slot requests cannot be served yet (no RM
connected)
3. RM registration succeeds, slots get allocated and activated, operators
switch to DEPLOYING
4. JM loses leadership, job gets suspended
5. JM regains leadership
6. TaskManager reports: slots get rejected by the job manager
7. TaskExecutors close JM connection: no more allocated slots
8. Slot allocation times out.
> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
> "NoResourceAvailableException: Could not allocate the required slot within
> slot request timeout"
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-19237
> URL: https://issues.apache.org/jira/browse/FLINK-19237
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Dian Fu
> Assignee: Robert Metzger
> Priority: Critical
> Labels: test-stability
> Fix For: 1.12.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6499&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=fd9796c3-9ce8-5619-781c-42f873e126a6]
> {code}
> 2020-09-14T21:11:02.8200203Z [ERROR]
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
> Time elapsed: 300.14 s <<< FAILURE!
> 2020-09-14T21:11:02.8201761Z java.lang.AssertionError: Job failed.
> 2020-09-14T21:11:02.8202749Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-09-14T21:11:02.8203794Z at
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-09-14T21:11:02.8205177Z at
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-09-14T21:11:02.8206191Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-09-14T21:11:02.8206985Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-09-14T21:11:02.8207930Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-09-14T21:11:02.8208927Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-09-14T21:11:02.8209753Z at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-09-14T21:11:02.8210710Z at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-09-14T21:11:02.8211608Z at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-09-14T21:11:02.8214473Z at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-09-14T21:11:02.8215398Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-14T21:11:02.8216199Z at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-09-14T21:11:02.8216947Z at
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-09-14T21:11:02.8217695Z at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-09-14T21:11:02.8218635Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-09-14T21:11:02.8219499Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-09-14T21:11:02.8220313Z at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-14T21:11:02.8221060Z at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-14T21:11:02.8222171Z at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-14T21:11:02.8222937Z at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-09-14T21:11:02.8223688Z at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-09-14T21:11:02.8225191Z at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-14T21:11:02.8226086Z at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-09-14T21:11:02.8226761Z at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-09-14T21:11:02.8227453Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-09-14T21:11:02.8228392Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-09-14T21:11:02.8229256Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-09-14T21:11:02.8235798Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-09-14T21:11:02.8237650Z at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-09-14T21:11:02.8239039Z at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-09-14T21:11:02.8239894Z at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-09-14T21:11:02.8240591Z at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-09-14T21:11:02.8241325Z Caused by:
> org.apache.flink.runtime.JobException: Recovery is suppressed by
> NoRestartBackoffTimeStrategy
> 2020-09-14T21:11:02.8242225Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-09-14T21:11:02.8243358Z at
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-09-14T21:11:02.8244425Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:215)
> 2020-09-14T21:11:02.8245291Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:208)
> 2020-09-14T21:11:02.8246150Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:202)
> 2020-09-14T21:11:02.8247006Z at
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:523)
> 2020-09-14T21:11:02.8247960Z at
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> 2020-09-14T21:11:02.8249102Z at
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1722)
> 2020-09-14T21:11:02.8249971Z at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1283)
> 2020-09-14T21:11:02.8250675Z at
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1251)
> 2020-09-14T21:11:02.8251369Z at
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1082)
> 2020-09-14T21:11:02.8252104Z at
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> 2020-09-14T21:11:02.8253060Z at
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> 2020-09-14T21:11:02.8253956Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:458)
> 2020-09-14T21:11:02.8254967Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:445)
> 2020-09-14T21:11:02.8393562Z at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-09-14T21:11:02.8394920Z at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-09-14T21:11:02.8396122Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8397194Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8398150Z at
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:169)
> 2020-09-14T21:11:02.8399234Z at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-09-14T21:11:02.8400048Z at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-09-14T21:11:02.8401048Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8402025Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8403171Z at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:731)
> 2020-09-14T21:11:02.8404708Z at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> 2020-09-14T21:11:02.8405751Z at
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> 2020-09-14T21:11:02.8406633Z at
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-09-14T21:11:02.8407378Z at
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-09-14T21:11:02.8408120Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8408948Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8409748Z at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1168)
> 2020-09-14T21:11:02.8410511Z at
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-09-14T21:11:02.8411543Z at
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-09-14T21:11:02.8412553Z at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8413340Z at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8414204Z at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1072)
> 2020-09-14T21:11:02.8415364Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> 2020-09-14T21:11:02.8416128Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> 2020-09-14T21:11:02.8417172Z at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-09-14T21:11:02.8417995Z at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-09-14T21:11:02.8418997Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-09-14T21:11:02.8419692Z at
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-09-14T21:11:02.8420336Z at
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-09-14T21:11:02.8421055Z at
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-09-14T21:11:02.8421655Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-09-14T21:11:02.8422336Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-09-14T21:11:02.8423049Z at
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-09-14T21:11:02.8423681Z at
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-09-14T21:11:02.8424505Z at
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-09-14T21:11:02.8425209Z at
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-09-14T21:11:02.8425760Z at
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-09-14T21:11:02.8426376Z at
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-09-14T21:11:02.8427252Z at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-09-14T21:11:02.8427931Z at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-09-14T21:11:02.8428684Z at
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-09-14T21:11:02.8429375Z at
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-09-14T21:11:02.8430118Z at
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-09-14T21:11:02.8430853Z at
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-09-14T21:11:02.8431971Z Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate the required slot within slot request timeout. Please make
> sure that the cluster has enough resources.
> 2020-09-14T21:11:02.8433179Z at
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:464)
> 2020-09-14T21:11:02.8434082Z ... 45 more
> 2020-09-14T21:11:02.8434809Z Caused by:
> java.util.concurrent.CompletionException:
> java.util.concurrent.TimeoutException
> 2020-09-14T21:11:02.8435611Z at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 2020-09-14T21:11:02.8436379Z at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 2020-09-14T21:11:02.8437159Z at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> 2020-09-14T21:11:02.8437976Z at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-09-14T21:11:02.8438658Z ... 25 more
> 2020-09-14T21:11:02.8439085Z Caused by: java.util.concurrent.TimeoutException
> 2020-09-14T21:11:02.8439476Z ... 23 more
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)