[ 
https://issues.apache.org/jira/browse/FLINK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132971#comment-17132971
 ] 

Robert Metzger commented on FLINK-18137:
----------------------------------------

The error is somewhat reproducible locally, BUT not when running the test until 
failure.
Only when starting it from scratch.

It seems that the test is passing when the savepoint is triggered right after 
the vertex switches to RUNNING (my logs distinguish between SAVEPOINT and 
CHECKPOINT): 
{code}
3431 [testVertex (1/1)] INFO  org.apache.flink.runtime.taskmanager.Task [] - 
testVertex (1/1) (40dd161a086ef633b70d9f505ba614c6) switched from DEPLOYING to 
RUNNING.
3432 [flink-akka.actor.default-dispatcher-5] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex (1/1) 
(40dd161a086ef633b70d9f505ba614c6) switched from DEPLOYING to RUNNING.
3496 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering 
SAVEPOINT 2 @ 1591798767033 for job f51485b0d91e3ff6be1d7a55d78c57d7.
3499 [flink-akka.actor.default-dispatcher-3] DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Trigger checkpoint 
2@1591798767033 for 40dd161a086ef633b70d9f505ba614c6.
3504 [jobmanager-future-thread-3] DEBUG 
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received 
acknowledge message for checkpoint 2 from task 40dd161a086ef633b70d9f505ba614c6 
of job f51485b0d91e3ff6be1d7a55d78c57d7 at 3ddadd8d-c5f3-495e-9c9c-b9278100c72a 
@ localhost (dataPort=-1).
3522 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.jobmaster.JobMaster [] - Savepoint stored in 
file:/var/folders/js/yfk_y2450q7559kygttykwk00000gn/T/junit3940216507898676469/junit3659258369742740599/savepoint-f51485-2066a260a0e0.
 Now cancelling f51485b0d91e3ff6be1d7a55d78c57d7.
3523 [flink-akka.actor.default-dispatcher-2] INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job (unnamed job) 
(f51485b0d91e3ff6be1d7a55d78c57d7) switched from state RUNNING to CANCELLING.
{code}

However, if the Checkpoint Timer manages to trigger a checkpoint in between, 
then the test is failing.
{code}
3585 [Checkpoint Timer] INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering 
CHECKPOINT 1 @ 1591797060672 for job c5388d02ac93d92790e7a43b0b9f8c08.
3589 [flink-akka.actor.default-dispatcher-4] DEBUG 
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Abort checkpoint 
1@1591797060672 for 4174621aed015dd50b4ec8a204ba8724.
{code}

What I believe is happening:
- C1: Checkpoint 1 gets triggered, sets CheckpointCoordinator.isTriggering flag 
to "true"
- S1: triggerSavepoint call comes in
- S1: SchedulerBase.triggerSavepoint(SchedulerBase.java:751) calls 
checkpointCoordinator.stopCheckpointScheduler() and cancels all pending 
checkpoints, in this case Checkpoint 1. 
- S1: SchedulerBase.triggerSavepoint(SchedulerBase.java:751) now is triggering 
a checkpoint at the Coordinator, however, the CheckpointRequestDecider decides 
to not execute any CheckpointTriggerRequest, because isTriggering = true.
- the system never returns anything on the SchedulerBase.triggerSavepoint call 
(because the case that the decider opts for "nothing" is not handled here: 
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L501)
 

Solutions:
a) set isTriggering to "false" when aborting pending checkpoints
b) only modify isTriggering under the CheckpointCoordinator.lock (the 
checkpoint triggering is not happening under the lock)

I would appreciate if somebody who knows this component better could validate 
my analysis and suggest how to resolve the problem.


> JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint fails with 
> AskTimeoutException
> ----------------------------------------------------------------------------------------
>
>                 Key: FLINK-18137
>                 URL: https://issues.apache.org/jira/browse/FLINK-18137
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Runtime / Task
>    Affects Versions: 1.11.0, 1.12.0
>            Reporter: Robert Metzger
>            Assignee: Robert Metzger
>            Priority: Critical
>              Labels: test-stability
>             Fix For: 1.11.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2747&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=45cc9205-bdb7-5b54-63cd-89fdc0983323
> {code}
> 2020-06-04T16:17:20.4404189Z [ERROR] Tests run: 4, Failures: 0, Errors: 1, 
> Skipped: 0, Time elapsed: 14.352 s <<< FAILURE! - in 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase
> 2020-06-04T16:17:20.4405548Z [ERROR] 
> testStopJobAfterSavepoint(org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase)
>   Time elapsed: 10.058 s  <<< ERROR!
> 2020-06-04T16:17:20.4407342Z java.util.concurrent.ExecutionException: 
> java.util.concurrent.TimeoutException: Invocation of public default 
> java.util.concurrent.CompletableFuture 
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
>  timed out.
> 2020-06-04T16:17:20.4409562Z  at 
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-06-04T16:17:20.4410333Z  at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-06-04T16:17:20.4411259Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:264)
> 2020-06-04T16:17:20.4412292Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint(JobMasterTriggerSavepointITCase.java:127)
> 2020-06-04T16:17:20.4413163Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-06-04T16:17:20.4413990Z  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-06-04T16:17:20.4414783Z  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-06-04T16:17:20.4415936Z  at 
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-06-04T16:17:20.4416693Z  at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-06-04T16:17:20.4417632Z  at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-06-04T16:17:20.4418637Z  at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-06-04T16:17:20.4419367Z  at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-06-04T16:17:20.4420118Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4420742Z  at 
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-06-04T16:17:20.4421909Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4422493Z  at 
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-06-04T16:17:20.4423247Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-06-04T16:17:20.4424263Z  at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-06-04T16:17:20.4424876Z  at 
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-06-04T16:17:20.4426346Z  at 
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-06-04T16:17:20.4427052Z  at 
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-06-04T16:17:20.4427772Z  at 
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-06-04T16:17:20.4428562Z  at 
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-06-04T16:17:20.4429158Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4429861Z  at 
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4430448Z  at 
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4431060Z  at 
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-06-04T16:17:20.4431678Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-06-04T16:17:20.4432513Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-06-04T16:17:20.4433396Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-06-04T16:17:20.4434298Z  at 
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-06-04T16:17:20.4440904Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-06-04T16:17:20.4443425Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-06-04T16:17:20.4444349Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-06-04T16:17:20.4445160Z  at 
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-06-04T16:17:20.4446389Z Caused by: 
> java.util.concurrent.TimeoutException: Invocation of public default 
> java.util.concurrent.CompletableFuture 
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
>  timed out.
> 2020-06-04T16:17:20.4447610Z  at 
> com.sun.proxy.$Proxy31.triggerSavepoint(Unknown Source)
> 2020-06-04T16:17:20.4448545Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$8(MiniCluster.java:595)
> 2020-06-04T16:17:20.4449259Z  at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2020-06-04T16:17:20.4449990Z  at 
> java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
> 2020-06-04T16:17:20.4450789Z  at 
> java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
> 2020-06-04T16:17:20.4451584Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:621)
> 2020-06-04T16:17:20.4452473Z  at 
> org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:595)
> 2020-06-04T16:17:20.4453572Z  at 
> org.apache.flink.client.program.MiniClusterClient.cancelWithSavepoint(MiniClusterClient.java:89)
> 2020-06-04T16:17:20.4454746Z  at 
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:262)
> 2020-06-04T16:17:20.4455517Z  ... 32 more
> 2020-06-04T16:17:20.4457589Z Caused by: akka.pattern.AskTimeoutException: Ask 
> timed out on [Actor[akka://flink/user/rpc/dispatcher_2#830345697]] after 
> [10000 ms]. Message of type 
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason 
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 2020-06-04T16:17:20.4459164Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460107Z  at 
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460819Z  at 
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-06-04T16:17:20.4461613Z  at 
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-06-04T16:17:20.4462444Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-06-04T16:17:20.4463203Z  at 
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-06-04T16:17:20.4464089Z  at 
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-06-04T16:17:20.4464833Z  at 
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-06-04T16:17:20.4465800Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-06-04T16:17:20.4466746Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-06-04T16:17:20.4467579Z  at 
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-06-04T16:17:20.4468467Z  at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to