[ 
https://issues.apache.org/jira/browse/FLINK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133337#comment-17133337
 ] 

Biao Liu commented on FLINK-18137:
----------------------------------

I just saw this issue. I think [~trohrmann] is right.
There is a problem of if/else in 
[https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L547].
 The {{throwable}} passed to {{onTriggerFailure}} can be null unexpectedly. 
Actually that's my fault, this code is written by me and I realized it some 
days ago. I was planning to fix it in the later PR because I checked it at that 
time that it can't raise NPE, so I thought it's not emergency. However 
FLINK-16770 breaks the plan, I reverted a lot of codes and forgot to fix this 
potential issue separately. Unfortunately 
https://github.com/apache/flink/commit/1af33f1285d557f0171f4587d7f4e789df27e7cb 
hits this NPE. {{onTriggerFailure}} shouldn't throw any exception by design.
The codes changed a bit from my last commit. I need to double check the comment 
mentioned by [~roman_khachatryan] to make sure there is no other issue.


> JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint fails with 
> AskTimeoutException
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-18137
>                 URL: https://issues.apache.org/jira/browse/FLINK-18137
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination, Runtime 
> / Task, Tests
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Robert Metzger
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: pull-request-available, test-stability
>             Fix For: 1.11.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2747&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=45cc9205-bdb7-5b54-63cd-89fdc0983323
> {code}
> 2020-06-04T16:17:20.4404189Z [ERROR] Tests run: 4, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 14.352 s <<< FAILURE! - in 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase
> 2020-06-04T16:17:20.4405548Z [ERROR] 
> testStopJobAfterSavepoint(org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase)
>   Time elapsed: 10.058 s  <<< ERROR!
> 2020-06-04T16:17:20.4407342Z java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: Invocation of public default 
> java.util.concurrent.CompletableFuture 
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
>  timed out.
> 2020-06-04T16:17:20.4409562Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-06-04T16:17:20.4410333Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-06-04T16:17:20.4411259Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:264)
> 2020-06-04T16:17:20.4412292Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint(JobMasterTriggerSavepointITCase.java:127)
> 2020-06-04T16:17:20.4413163Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-06-04T16:17:20.4413990Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-06-04T16:17:20.4414783Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-06-04T16:17:20.4415936Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-06-04T16:17:20.4416693Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-06-04T16:17:20.4417632Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-06-04T16:17:20.4418637Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-06-04T16:17:20.4419367Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-06-04T16:17:20.4420118Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4420742Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-06-04T16:17:20.4421909Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4422493Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-06-04T16:17:20.4423247Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-06-04T16:17:20.4424263Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-06-04T16:17:20.4424876Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-06-04T16:17:20.4426346Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-06-04T16:17:20.4427052Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-06-04T16:17:20.4427772Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-06-04T16:17:20.4428562Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-06-04T16:17:20.4429158Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4429861Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4430448Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4431060Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-06-04T16:17:20.4431678Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-06-04T16:17:20.4432513Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-06-04T16:17:20.4433396Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-06-04T16:17:20.4434298Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-06-04T16:17:20.4440904Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-06-04T16:17:20.4443425Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-06-04T16:17:20.4444349Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-06-04T16:17:20.4445160Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-06-04T16:17:20.4446389Z Caused by: 
> java.util.concurrent.TimeoutException: Invocation of public default 
> java.util.concurrent.CompletableFuture 
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
>  timed out.
> 2020-06-04T16:17:20.4447610Z  at 
> com.sun.proxy.$Proxy31.triggerSavepoint(Unknown Source)
> 2020-06-04T16:17:20.4448545Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$8(MiniCluster.java:595)
> 2020-06-04T16:17:20.4449259Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2020-06-04T16:17:20.4449990Z  at 
> java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
> 2020-06-04T16:17:20.4450789Z  at 
> java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
> 2020-06-04T16:17:20.4451584Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:621)
> 2020-06-04T16:17:20.4452473Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:595)
> 2020-06-04T16:17:20.4453572Z  at 
> org.apache.flink.client.program.MiniClusterClient.cancelWithSavepoint(MiniClusterClient.java:89)
> 2020-06-04T16:17:20.4454746Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:262)
> 2020-06-04T16:17:20.4455517Z  ... 32 more
> 2020-06-04T16:17:20.4457589Z Caused by: akka.pattern.AskTimeoutException: Ask 
> timed out on [Actor[akka://flink/user/rpc/dispatcher_2#830345697]] after 
> [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 2020-06-04T16:17:20.4459164Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460107Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460819Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-06-04T16:17:20.4461613Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-06-04T16:17:20.4462444Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-06-04T16:17:20.4463203Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-06-04T16:17:20.4464089Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-06-04T16:17:20.4464833Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-06-04T16:17:20.4465800Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-06-04T16:17:20.4466746Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-06-04T16:17:20.4467579Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-06-04T16:17:20.4468467Z  at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to