[ 
https://issues.apache.org/jira/browse/FLINK-19237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17208592#comment-17208592
 ] 

Robert Metzger commented on FLINK-19237:
----------------------------------------

In the normal case, it seems that:
a) the job gets submitted
b) operators get scheduled, but slot requests cannot be served yet (no RM 
connected),
c) RM registration starts, but doesn't finish
d) JM loses leadership, job gets suspended
e) JM regains leadership
f) job starts running, test succeeds.

In the failure case:
1. Job gets submitted
2. Operators get scheduled, but slot requests cannot be served yet (no RM 
connected)
3. RM registration succeeds, slots get allocated and activated, operators 
switch to DEPLOYING
4. JM loses leadership, job gets suspended
5. JM regains leadership
6. TaskManager reports: slots get rejected by the job manager
7. TaskExecutors close JM connection: no more allocated slots
8. Slot allocation times out. 


> LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with 
> "NoResourceAvailableException: Could not allocate the required slot within 
> slot request timeout"
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19237
>                 URL: https://issues.apache.org/jira/browse/FLINK-19237
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Dian Fu
>            Assignee: Robert Metzger
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.12.0
>
>
> [https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=6499&view=logs&j=6bfdaf55-0c08-5e3f-a2d2-2a0285fd41cf&t=fd9796c3-9ce8-5619-781c-42f873e126a6]
> {code}
> 2020-09-14T21:11:02.8200203Z [ERROR] 
> testReelectionOfJobMaster(org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest)
>   Time elapsed: 300.14 s  <<< FAILURE!
> 2020-09-14T21:11:02.8201761Z java.lang.AssertionError: Job failed.
> 2020-09-14T21:11:02.8202749Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.throwAssertionErrorOnFailedResult(JobResultUtils.java:54)
> 2020-09-14T21:11:02.8203794Z  at 
> org.apache.flink.runtime.jobmaster.utils.JobResultUtils.assertSuccess(JobResultUtils.java:30)
> 2020-09-14T21:11:02.8205177Z  at 
> org.apache.flink.runtime.leaderelection.LeaderChangeClusterComponentsTest.testReelectionOfJobMaster(LeaderChangeClusterComponentsTest.java:152)
> 2020-09-14T21:11:02.8206191Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-09-14T21:11:02.8206985Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-09-14T21:11:02.8207930Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-09-14T21:11:02.8208927Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-09-14T21:11:02.8209753Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-09-14T21:11:02.8210710Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-09-14T21:11:02.8211608Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-09-14T21:11:02.8214473Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-09-14T21:11:02.8215398Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-14T21:11:02.8216199Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-09-14T21:11:02.8216947Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-09-14T21:11:02.8217695Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-09-14T21:11:02.8218635Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-09-14T21:11:02.8219499Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-09-14T21:11:02.8220313Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-09-14T21:11:02.8221060Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-09-14T21:11:02.8222171Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-09-14T21:11:02.8222937Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-09-14T21:11:02.8223688Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-09-14T21:11:02.8225191Z  at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 2020-09-14T21:11:02.8226086Z  at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> 2020-09-14T21:11:02.8226761Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-09-14T21:11:02.8227453Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-09-14T21:11:02.8228392Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-09-14T21:11:02.8229256Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-09-14T21:11:02.8235798Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-09-14T21:11:02.8237650Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-09-14T21:11:02.8239039Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-09-14T21:11:02.8239894Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-09-14T21:11:02.8240591Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-09-14T21:11:02.8241325Z Caused by: 
> org.apache.flink.runtime.JobException: Recovery is suppressed by 
> NoRestartBackoffTimeStrategy
> 2020-09-14T21:11:02.8242225Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:116)
> 2020-09-14T21:11:02.8243358Z  at 
> org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:78)
> 2020-09-14T21:11:02.8244425Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:215)
> 2020-09-14T21:11:02.8245291Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:208)
> 2020-09-14T21:11:02.8246150Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:202)
> 2020-09-14T21:11:02.8247006Z  at 
> org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:523)
> 2020-09-14T21:11:02.8247960Z  at 
> org.apache.flink.runtime.scheduler.UpdateSchedulerNgOnInternalFailuresListener.notifyTaskFailure(UpdateSchedulerNgOnInternalFailuresListener.java:49)
> 2020-09-14T21:11:02.8249102Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionGraph.notifySchedulerNgAboutInternalTaskFailure(ExecutionGraph.java:1722)
> 2020-09-14T21:11:02.8249971Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1283)
> 2020-09-14T21:11:02.8250675Z  at 
> org.apache.flink.runtime.executiongraph.Execution.processFail(Execution.java:1251)
> 2020-09-14T21:11:02.8251369Z  at 
> org.apache.flink.runtime.executiongraph.Execution.markFailed(Execution.java:1082)
> 2020-09-14T21:11:02.8252104Z  at 
> org.apache.flink.runtime.executiongraph.ExecutionVertex.markFailed(ExecutionVertex.java:748)
> 2020-09-14T21:11:02.8253060Z  at 
> org.apache.flink.runtime.scheduler.DefaultExecutionVertexOperations.markFailed(DefaultExecutionVertexOperations.java:41)
> 2020-09-14T21:11:02.8253956Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskDeploymentFailure(DefaultScheduler.java:458)
> 2020-09-14T21:11:02.8254967Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.lambda$assignResourceOrHandleError$6(DefaultScheduler.java:445)
> 2020-09-14T21:11:02.8393562Z  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-09-14T21:11:02.8394920Z  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-09-14T21:11:02.8396122Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8397194Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8398150Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SchedulerImpl.lambda$internalAllocateSlot$0(SchedulerImpl.java:169)
> 2020-09-14T21:11:02.8399234Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-09-14T21:11:02.8400048Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-09-14T21:11:02.8401048Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8402025Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8403171Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$SingleTaskSlot.release(SlotSharingManager.java:731)
> 2020-09-14T21:11:02.8404708Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.release(SlotSharingManager.java:537)
> 2020-09-14T21:11:02.8405751Z  at 
> org.apache.flink.runtime.jobmaster.slotpool.SlotSharingManager$MultiTaskSlot.lambda$new$0(SlotSharingManager.java:432)
> 2020-09-14T21:11:02.8406633Z  at 
> java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:836)
> 2020-09-14T21:11:02.8407378Z  at 
> java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:811)
> 2020-09-14T21:11:02.8408120Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8408948Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8409748Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$forwardTo$21(FutureUtils.java:1168)
> 2020-09-14T21:11:02.8410511Z  at 
> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
> 2020-09-14T21:11:02.8411543Z  at 
> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
> 2020-09-14T21:11:02.8412553Z  at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> 2020-09-14T21:11:02.8413340Z  at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
> 2020-09-14T21:11:02.8414204Z  at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1072)
> 2020-09-14T21:11:02.8415364Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:402)
> 2020-09-14T21:11:02.8416128Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:195)
> 2020-09-14T21:11:02.8417172Z  at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
> 2020-09-14T21:11:02.8417995Z  at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
> 2020-09-14T21:11:02.8418997Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> 2020-09-14T21:11:02.8419692Z  at 
> akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> 2020-09-14T21:11:02.8420336Z  at 
> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> 2020-09-14T21:11:02.8421055Z  at 
> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> 2020-09-14T21:11:02.8421655Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> 2020-09-14T21:11:02.8422336Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-09-14T21:11:02.8423049Z  at 
> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> 2020-09-14T21:11:02.8423681Z  at 
> akka.actor.Actor$class.aroundReceive(Actor.scala:517)
> 2020-09-14T21:11:02.8424505Z  at 
> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
> 2020-09-14T21:11:02.8425209Z  at 
> akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
> 2020-09-14T21:11:02.8425760Z  at 
> akka.actor.ActorCell.invoke(ActorCell.scala:561)
> 2020-09-14T21:11:02.8426376Z  at 
> akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
> 2020-09-14T21:11:02.8427252Z  at akka.dispatch.Mailbox.run(Mailbox.scala:225)
> 2020-09-14T21:11:02.8427931Z  at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
> 2020-09-14T21:11:02.8428684Z  at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 2020-09-14T21:11:02.8429375Z  at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 2020-09-14T21:11:02.8430118Z  at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 2020-09-14T21:11:02.8430853Z  at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 2020-09-14T21:11:02.8431971Z Caused by: 
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate the required slot within slot request timeout. Please make 
> sure that the cluster has enough resources.
> 2020-09-14T21:11:02.8433179Z  at 
> org.apache.flink.runtime.scheduler.DefaultScheduler.maybeWrapWithNoResourceAvailableException(DefaultScheduler.java:464)
> 2020-09-14T21:11:02.8434082Z  ... 45 more
> 2020-09-14T21:11:02.8434809Z Caused by: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.TimeoutException
> 2020-09-14T21:11:02.8435611Z  at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> 2020-09-14T21:11:02.8436379Z  at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> 2020-09-14T21:11:02.8437159Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
> 2020-09-14T21:11:02.8437976Z  at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> 2020-09-14T21:11:02.8438658Z  ... 25 more
> 2020-09-14T21:11:02.8439085Z Caused by: java.util.concurrent.TimeoutException
> 2020-09-14T21:11:02.8439476Z  ... 23 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to