[
https://issues.apache.org/jira/browse/FLINK-22568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17402295#comment-17402295
]
Till Rohrmann commented on FLINK-22568:
---------------------------------------
I think the problem is our CI infrastructure because we have a gap of 14s
between receiving the savepoint command and the actual triggering:
{code}
12:59:25,726 [flink-akka.actor.default-dispatcher-5] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering
savepoint for job 2e88260d7fd42515ac6b8181788b2583.
12:59:27,579 [jobmanager-future-thread-6] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 1 for job 2e88260d7fd42515ac6b8181788b2583 (1362646 bytes,
checkpointDuration=1623 ms, finalizationTime=291 ms).
12:59:39,015 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2 (type=SAVEPOINT) @ 1628773179014 for job
2e88260d7fd42515ac6b8181788b2583.
12:59:39,019 [ main] ERROR
org.apache.flink.test.checkpointing.RescalingITCase [] -
--------------------------------------------------------------------------------
Test testSavepointRescalingWithKeyedAndNonPartitionedState[backend =
filesystem, buffersPerChannel =
0](org.apache.flink.test.checkpointing.RescalingITCase) failed with:
java.util.concurrent.ExecutionException: java.util.concurrent.TimeoutException:
Invocation of [LocalRpcInvocation(RestfulGateway.triggerSavepoint(JobID,
String, boolean, Time))] at recipient [akka://flink/user/rpc/dispatcher_200]
timed out.
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
at
org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingWithKeyedAndNonPartitionedState(RescalingITCase.java:425)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at
org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at org.junit.runner.JUnitCore.run(JUnitCore.java:115)
at
org.junit.vintage.engine.execution.RunnerExecutor.execute(RunnerExecutor.java:43)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:485)
at
org.junit.vintage.engine.VintageTestEngine.executeAllChildren(VintageTestEngine.java:82)
at
org.junit.vintage.engine.VintageTestEngine.execute(VintageTestEngine.java:73)
at
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:220)
at
org.junit.platform.launcher.core.DefaultLauncher.lambda$execute$6(DefaultLauncher.java:188)
at
org.junit.platform.launcher.core.DefaultLauncher.withInterceptedStreams(DefaultLauncher.java:202)
at
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:181)
at
org.junit.platform.launcher.core.DefaultLauncher.execute(DefaultLauncher.java:128)
at
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invokeAllTests(JUnitPlatformProvider.java:150)
at
org.apache.maven.surefire.junitplatform.JUnitPlatformProvider.invoke(JUnitPlatformProvider.java:120)
at
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
at
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
at
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
at
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
Caused by: java.util.concurrent.TimeoutException: Invocation of
[LocalRpcInvocation(RestfulGateway.triggerSavepoint(JobID, String, boolean,
Time))] at recipient [akka://flink/user/rpc/dispatcher_200] timed out.
at com.sun.proxy.$Proxy369.triggerSavepoint(Unknown Source)
at
org.apache.flink.runtime.minicluster.MiniCluster.lambda$triggerSavepoint$9(MiniCluster.java:741)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
at
java.util.concurrent.CompletableFuture.uniApplyStage(CompletableFuture.java:628)
at
java.util.concurrent.CompletableFuture.thenApply(CompletableFuture.java:1996)
at
org.apache.flink.runtime.minicluster.MiniCluster.runDispatcherCommand(MiniCluster.java:776)
at
org.apache.flink.runtime.minicluster.MiniCluster.triggerSavepoint(MiniCluster.java:739)
at
org.apache.flink.client.program.MiniClusterClient.triggerSavepoint(MiniClusterClient.java:101)
at
org.apache.flink.test.checkpointing.RescalingITCase.testSavepointRescalingWithKeyedAndNonPartitionedState(RescalingITCase.java:422)
... 60 more
Caused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka://flink/user/rpc/dispatcher_200#1435682858]] after [10000 ms].
Message of type [org.apache.flink.runtime.rpc.messages.LocalFencedMessage]. A
typical reason for `AskTimeoutException` is that the recipient actor didn't
send a reply.
{code}
I think this problem should be hopefully resolved/mitigated via FLINK-22932.
> RescalingITCase.testSavepointRescalingInPartitionedOperatorStateList fails
> with Timeout
> ---------------------------------------------------------------------------------------
>
> Key: FLINK-22568
> URL: https://issues.apache.org/jira/browse/FLINK-22568
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Matthias
> Priority: Major
> Labels: test-stability
> Fix For: 1.14.0
>
>
> [This
> build|https://dev.azure.com/mapohl/flink/_build/results?buildId=409&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87]
> failed (not exclusively) due to:
> * [testSavepointRescalingInPartitionedOperatorStateList[backend =
> filesystem](org.apache.flink.test.checkpointing.RescalingITCase)|https://dev.azure.com/mapohl/flink/_build/results?buildId=409&view=logs&j=0a15d512-44ac-5ba5-97ab-13a5d066c22c&t=634cd701-c189-5dff-24cb-606ed884db87&l=4193]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)