[ https://issues.apache.org/jira/browse/FLINK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133337#comment-17133337 ]
Biao Liu commented on FLINK-18137: ---------------------------------- I just saw this issue. I think [~trohrmann] is right. There is a problem of if/else in [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547]. The {{throwable}} passed to {{onTriggerFailure}} can be null unexpectedly. Actually that's my fault, this code is written by me and I realized it some days ago. I was planning to fix it in the later PR because I checked it at that time that it can't raise NPE, so I thought it's not emergency. However FLINK-16770 breaks the plan, I reverted a lot of codes and forgot to fix this potential issue separately. Unfortunately https://github.com/apache/flink/commit/1af33f1285d557f0171f4587d7f4e789df27e7cb hits this NPE. {{onTriggerFailure}} shouldn't throw any exception by design. The codes changed a bit from my last commit. I need to double check the comment mentioned by [~roman_khachatryan] to make sure there is no other issue. > JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint fails with > AskTimeoutException > ---------------------------------------------------------------------------------------- > > Key: FLINK-18137 > URL: https://issues.apache.org/jira/browse/FLINK-18137 > Project: Flink > Issue Type: Bug > Components: Runtime / Checkpointing, Runtime / Coordination, Runtime > / Task, Tests > Affects Versions: 1.11.0, 1.12.0 > Reporter: Robert Metzger > Assignee: Till Rohrmann > Priority: Blocker > Labels: pull-request-available, test-stability > Fix For: 1.11.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2747&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=45cc9205-bdb7-5b54-63cd-89fdc0983323 > {code} > 2020-06-04T16:17:20.4404189Z [ERROR] Tests run: 4, Failures: 0, Errors: 1, > Skipped: 0, Time elapsed: 14.352 s <<< FAILURE! - in > org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase > 2020-06-04T16:17:20.4405548Z [ERROR] > testStopJobAfterSavepoint(org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase) > Time elapsed: 10.058 s <<< ERROR! > 2020-06-04T16:17:20.4407342Z java.util.concurrent.ExecutionException: > java.util.concurrent.TimeoutException: Invocation of public default > java.util.concurrent.CompletableFuture > org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time) > timed out. > 2020-06-04T16:17:20.4409562Z at > java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) > 2020-06-04T16:17:20.4410333Z at > java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > 2020-06-04T16:17:20.4411259Z at > org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:264) > 2020-06-04T16:17:20.4412292Z at > org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint(JobMasterTriggerSavepointITCase.java:127) > 2020-06-04T16:17:20.4413163Z at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > 2020-06-04T16:17:20.4413990Z at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > 2020-06-04T16:17:20.4414783Z at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > 2020-06-04T16:17:20.4415936Z at > java.lang.reflect.Method.invoke(Method.java:498) > 2020-06-04T16:17:20.4416693Z at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > 2020-06-04T16:17:20.4417632Z at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > 2020-06-04T16:17:20.4418637Z at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > 2020-06-04T16:17:20.4419367Z at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > 2020-06-04T16:17:20.4420118Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-06-04T16:17:20.4420742Z at > org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55) > 2020-06-04T16:17:20.4421909Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-06-04T16:17:20.4422493Z at > org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > 2020-06-04T16:17:20.4423247Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > 2020-06-04T16:17:20.4424263Z at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > 2020-06-04T16:17:20.4424876Z at > org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > 2020-06-04T16:17:20.4426346Z at > org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > 2020-06-04T16:17:20.4427052Z at > org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > 2020-06-04T16:17:20.4427772Z at > org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > 2020-06-04T16:17:20.4428562Z at > org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > 2020-06-04T16:17:20.4429158Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-06-04T16:17:20.4429861Z at > org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48) > 2020-06-04T16:17:20.4430448Z at > org.junit.rules.RunRules.evaluate(RunRules.java:20) > 2020-06-04T16:17:20.4431060Z at > org.junit.runners.ParentRunner.run(ParentRunner.java:363) > 2020-06-04T16:17:20.4431678Z at > org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365) > 2020-06-04T16:17:20.4432513Z at > org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273) > 2020-06-04T16:17:20.4433396Z at > org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238) > 2020-06-04T16:17:20.4434298Z at > org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159) > 2020-06-04T16:17:20.4440904Z at > org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384) > 2020-06-04T16:17:20.4443425Z at > org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345) > 2020-06-04T16:17:20.4444349Z at > org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126) > 2020-06-04T16:17:20.4445160Z at > org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418) > 2020-06-04T16:17:20.4446389Z Caused by: > java.util.concurrent.TimeoutException: Invocation of public default > java.util.concurrent.CompletableFuture > org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time) > timed out. > 2020-06-04T16:17:20.4447610Z at > com.sun.proxy.$Proxy31.triggerSavepoint(Unknown Source) > 2020-06-04T16:17:20.4448545Z at > org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$8(MiniCluster.java:595) > 2020-06-04T16:17:20.4449259Z at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616) > 2020-06-04T16:17:20.4449990Z at > java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628) > 2020-06-04T16:17:20.4450789Z at > java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996) > 2020-06-04T16:17:20.4451584Z at > org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:621) > 2020-06-04T16:17:20.4452473Z at > org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:595) > 2020-06-04T16:17:20.4453572Z at > org.apache.flink.client.program.MiniClusterClient.cancelWithSavepoint(MiniClusterClient.java:89) > 2020-06-04T16:17:20.4454746Z at > org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:262) > 2020-06-04T16:17:20.4455517Z ... 32 more > 2020-06-04T16:17:20.4457589Z Caused by: akka.pattern.AskTimeoutException: Ask > timed out on [Actor[akka://flink/user/rpc/dispatcher_2#830345697]] after > [10000 ms]. Message of type > [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason > for `AskTimeoutException` is that the recipient actor didn't send a reply. > 2020-06-04T16:17:20.4459164Z at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > 2020-06-04T16:17:20.4460107Z at > akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635) > 2020-06-04T16:17:20.4460819Z at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648) > 2020-06-04T16:17:20.4461613Z at > akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205) > 2020-06-04T16:17:20.4462444Z at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > 2020-06-04T16:17:20.4463203Z at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > 2020-06-04T16:17:20.4464089Z at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > 2020-06-04T16:17:20.4464833Z at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328) > 2020-06-04T16:17:20.4465800Z at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279) > 2020-06-04T16:17:20.4466746Z at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283) > 2020-06-04T16:17:20.4467579Z at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235) > 2020-06-04T16:17:20.4468467Z at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)