[
https://issues.apache.org/jira/browse/FLINK-18137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17132971#comment-17132971
]
Robert Metzger commented on FLINK-18137:
----------------------------------------
The error is somewhat reproducible locally, BUT not when running the test until
failure.
Only when starting it from scratch.
It seems that the test is passing when the savepoint is triggered right after
the vertex switches to RUNNING (my logs distinguish between SAVEPOINT and
CHECKPOINT):
{code}
3431 [testVertex (1/1)] INFO org.apache.flink.runtime.taskmanager.Task [] -
testVertex (1/1) (40dd161a086ef633b70d9f505ba614c6) switched from DEPLOYING to
RUNNING.
3432 [flink-akka.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex (1/1)
(40dd161a086ef633b70d9f505ba614c6) switched from DEPLOYING to RUNNING.
3496 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
SAVEPOINT 2 @ 1591798767033 for job f51485b0d91e3ff6be1d7a55d78c57d7.
3499 [flink-akka.actor.default-dispatcher-3] DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Trigger checkpoint
2@1591798767033 for 40dd161a086ef633b70d9f505ba614c6.
3504 [jobmanager-future-thread-3] DEBUG
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received
acknowledge message for checkpoint 2 from task 40dd161a086ef633b70d9f505ba614c6
of job f51485b0d91e3ff6be1d7a55d78c57d7 at 3ddadd8d-c5f3-495e-9c9c-b9278100c72a
@ localhost (dataPort=-1).
3522 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Savepoint stored in
file:/var/folders/js/yfk_y2450q7559kygttykwk00000gn/T/junit3940216507898676469/junit3659258369742740599/savepoint-f51485-2066a260a0e0.
Now cancelling f51485b0d91e3ff6be1d7a55d78c57d7.
3523 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job (unnamed job)
(f51485b0d91e3ff6be1d7a55d78c57d7) switched from state RUNNING to CANCELLING.
{code}
However, if the Checkpoint Timer manages to trigger a checkpoint in between,
then the test is failing.
{code}
3585 [Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
CHECKPOINT 1 @ 1591797060672 for job c5388d02ac93d92790e7a43b0b9f8c08.
3589 [flink-akka.actor.default-dispatcher-4] DEBUG
org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Abort checkpoint
1@1591797060672 for 4174621aed015dd50b4ec8a204ba8724.
{code}
What I believe is happening:
- C1: Checkpoint 1 gets triggered, sets CheckpointCoordinator.isTriggering flag
to "true"
- S1: triggerSavepoint call comes in
- S1: SchedulerBase.triggerSavepoint(SchedulerBase.java:751) calls
checkpointCoordinator.stopCheckpointScheduler() and cancels all pending
checkpoints, in this case Checkpoint 1.
- S1: SchedulerBase.triggerSavepoint(SchedulerBase.java:751) now is triggering
a checkpoint at the Coordinator, however, the CheckpointRequestDecider decides
to not execute any CheckpointTriggerRequest, because isTriggering = true.
- the system never returns anything on the SchedulerBase.triggerSavepoint call
(because the case that the decider opts for "nothing" is not handled here:
https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L501)
Solutions:
a) set isTriggering to "false" when aborting pending checkpoints
b) only modify isTriggering under the CheckpointCoordinator.lock (the
checkpoint triggering is not happening under the lock)
I would appreciate if somebody who knows this component better could validate
my analysis and suggest how to resolve the problem.
> JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint fails with
> AskTimeoutException
> ----------------------------------------------------------------------------------------
>
> Key: FLINK-18137
> URL: https://issues.apache.org/jira/browse/FLINK-18137
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Runtime / Task
> Affects Versions: 1.11.0, 1.12.0
> Reporter: Robert Metzger
> Assignee: Robert Metzger
> Priority: Critical
> Labels: test-stability
> Fix For: 1.11.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=2747&view=logs&j=5c8e7682-d68f-54d1-16a2-a09310218a49&t=45cc9205-bdb7-5b54-63cd-89fdc0983323
> {code}
> 2020-06-04T16:17:20.4404189Z [ERROR] Tests run: 4, Failures: 0, Errors: 1,
> Skipped: 0, Time elapsed: 14.352 s <<< FAILURE! - in
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase
> 2020-06-04T16:17:20.4405548Z [ERROR]
> testStopJobAfterSavepoint(org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase)
> Time elapsed: 10.058 s <<< ERROR!
> 2020-06-04T16:17:20.4407342Z java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Invocation of public default
> java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
> timed out.
> 2020-06-04T16:17:20.4409562Z at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-06-04T16:17:20.4410333Z at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-06-04T16:17:20.4411259Z at
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:264)
> 2020-06-04T16:17:20.4412292Z at
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.testStopJobAfterSavepoint(JobMasterTriggerSavepointITCase.java:127)
> 2020-06-04T16:17:20.4413163Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-06-04T16:17:20.4413990Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-06-04T16:17:20.4414783Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-06-04T16:17:20.4415936Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-06-04T16:17:20.4416693Z at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-06-04T16:17:20.4417632Z at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-06-04T16:17:20.4418637Z at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-06-04T16:17:20.4419367Z at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-06-04T16:17:20.4420118Z at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4420742Z at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> 2020-06-04T16:17:20.4421909Z at
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4422493Z at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-06-04T16:17:20.4423247Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-06-04T16:17:20.4424263Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-06-04T16:17:20.4424876Z at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-06-04T16:17:20.4426346Z at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-06-04T16:17:20.4427052Z at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-06-04T16:17:20.4427772Z at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-06-04T16:17:20.4428562Z at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-06-04T16:17:20.4429158Z at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4429861Z at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> 2020-06-04T16:17:20.4430448Z at
> org.junit.rules.RunRules.evaluate(RunRules.java:20)
> 2020-06-04T16:17:20.4431060Z at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-06-04T16:17:20.4431678Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-06-04T16:17:20.4432513Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-06-04T16:17:20.4433396Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-06-04T16:17:20.4434298Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-06-04T16:17:20.4440904Z at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-06-04T16:17:20.4443425Z at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-06-04T16:17:20.4444349Z at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-06-04T16:17:20.4445160Z at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-06-04T16:17:20.4446389Z Caused by:
> java.util.concurrent.TimeoutException: Invocation of public default
> java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.webmonitor.RestfulGateway.triggerSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
> timed out.
> 2020-06-04T16:17:20.4447610Z at
> com.sun.proxy.$Proxy31.triggerSavepoint(Unknown Source)
> 2020-06-04T16:17:20.4448545Z at
> org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$8(MiniCluster.java:595)
> 2020-06-04T16:17:20.4449259Z at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> 2020-06-04T16:17:20.4449990Z at
> java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
> 2020-06-04T16:17:20.4450789Z at
> java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
> 2020-06-04T16:17:20.4451584Z at
> org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:621)
> 2020-06-04T16:17:20.4452473Z at
> org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:595)
> 2020-06-04T16:17:20.4453572Z at
> org.apache.flink.client.program.MiniClusterClient.cancelWithSavepoint(MiniClusterClient.java:89)
> 2020-06-04T16:17:20.4454746Z at
> org.apache.flink.runtime.jobmaster.JobMasterTriggerSavepointITCase.cancelWithSavepoint(JobMasterTriggerSavepointITCase.java:262)
> 2020-06-04T16:17:20.4455517Z ... 32 more
> 2020-06-04T16:17:20.4457589Z Caused by: akka.pattern.AskTimeoutException: Ask
> timed out on [Actor[akka://flink/user/rpc/dispatcher_2#830345697]] after
> [10000 ms]. Message of type
> [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A typical reason
> for `AskTimeoutException` is that the recipient actor didn't send a reply.
> 2020-06-04T16:17:20.4459164Z at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460107Z at
> akka.pattern.PromiseActorRef$$anonfun$2.apply(AskSupport.scala:635)
> 2020-06-04T16:17:20.4460819Z at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:648)
> 2020-06-04T16:17:20.4461613Z at
> akka.actor.Scheduler$$anon$4.run(Scheduler.scala:205)
> 2020-06-04T16:17:20.4462444Z at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> 2020-06-04T16:17:20.4463203Z at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> 2020-06-04T16:17:20.4464089Z at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> 2020-06-04T16:17:20.4464833Z at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:328)
> 2020-06-04T16:17:20.4465800Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:279)
> 2020-06-04T16:17:20.4466746Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:283)
> 2020-06-04T16:17:20.4467579Z at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:235)
> 2020-06-04T16:17:20.4468467Z at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)