[
https://issues.apache.org/jira/browse/FLINK-22367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325664#comment-17325664
]
Chesnay Schepler commented on FLINK-22367:
------------------------------------------
hmm... it seems like a task isn't shutting down after a stopWithSavepoint
operation. The savepoint operation itself seems to work fine, but only 1
subtask shuts down afterwards. After 10 seconds the test is timing out:
{code:java}
22:28:32,329 [ testVertex (2/2)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - testVertex
(2/2)#0 (043f6b5353f19eee95021c9988dc2907) switched from DEPLOYING to
INITIALIZING.
22:28:32,330 [ testVertex (2/2)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - testVertex
(2/2)#0 (043f6b5353f19eee95021c9988dc2907) switched from INITIALIZING to
RUNNING.
22:28:32,331 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex
(2/2) (043f6b5353f19eee95021c9988dc2907) switched from DEPLOYING to
INITIALIZING.
22:28:32,331 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex
(2/2) (043f6b5353f19eee95021c9988dc2907) switched from INITIALIZING to RUNNING.
22:28:32,343 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1618871312339 for job
15f6c6d1527f3cbd07771ec998fbb38f.
22:28:32,344 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.jobmaster.JobMaster [] - Triggering
stop-with-savepoint for job 15f6c6d1527f3cbd07771ec998fbb38f.
22:28:32,351 [jobmanager-future-thread-10] WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late
message for now expired checkpoint attempt 1 from task
0febd0493e4d0151092e5d9f1d11f8d7 of job 15f6c6d1527f3cbd07771ec998fbb38f at
0acad546-425d-4c54-81c6-318541f6fc12 @ localhost (dataPort=-1).
22:28:32,352 [jobmanager-future-thread-11] WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Received late
message for now expired checkpoint attempt 1 from task
043f6b5353f19eee95021c9988dc2907 of job 15f6c6d1527f3cbd07771ec998fbb38f at
0acad546-425d-4c54-81c6-318541f6fc12 @ localhost (dataPort=-1).
22:28:32,358 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 2 (type=SAVEPOINT_TERMINATE) @ 1618871312356 for job
15f6c6d1527f3cbd07771ec998fbb38f.
22:28:32,405 [jobmanager-future-thread-7] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
checkpoint 2 for job 15f6c6d1527f3cbd07771ec998fbb38f (0 bytes in 48 ms).
22:28:32,413 [ testVertex (1/2)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - testVertex
(1/2)#0 (0febd0493e4d0151092e5d9f1d11f8d7) switched from RUNNING to FINISHED.
22:28:32,413 [ testVertex (1/2)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - Freeing task
resources for testVertex (1/2)#0 (0febd0493e4d0151092e5d9f1d11f8d7).
22:28:32,414 [flink-akka.actor.default-dispatcher-3] INFO
org.apache.flink.runtime.taskexecutor.TaskExecutor [] -
Un-registering task and sending final execution state FINISHED to JobManager
for task testVertex (1/2)#0 0febd0493e4d0151092e5d9f1d11f8d7.
22:28:32,423 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex
(1/2) (0febd0493e4d0151092e5d9f1d11f8d7) switched from RUNNING to
FINISHED.22:28:42,376 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job Unnamed
job (15f6c6d1527f3cbd07771ec998fbb38f) switched from state RUNNING to
CANCELLING.
22:28:42,377 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Discarding
the results produced by task execution 0febd0493e4d0151092e5d9f1d11f8d7.
22:28:42,378 [flink-akka.actor.default-dispatcher-2] INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - testVertex
(2/2) (043f6b5353f19eee95021c9988dc2907) switched from RUNNING to CANCELING.
22:28:42,382 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.taskmanager.Task [] - Attempting to
cancel task testVertex (2/2)#0 (043f6b5353f19eee95021c9988dc2907).
22:28:42,383 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.taskmanager.Task [] - testVertex
(2/2)#0 (043f6b5353f19eee95021c9988dc2907) switched from RUNNING to CANCELING.
22:28:42,383 [flink-akka.actor.default-dispatcher-6] INFO
org.apache.flink.runtime.taskmanager.Task [] - Triggering
cancellation of task code testVertex (2/2)#0 (043f6b5353f19eee95021c9988dc2907).
22:28:42,385 [ testVertex (2/2)#0] INFO
org.apache.flink.runtime.taskmanager.Task [] - testVertex
(2/2)#0 (043f6b5353f19eee95021c9988dc2907) switched from CANCELING to CANCELED.
{code}
> JobMasterStopWithSavepointITCase.terminateWithSavepointWithoutComplicationsShouldSucceedAndLeadJobToFinished
> times out
> ----------------------------------------------------------------------------------------------------------------------
>
> Key: FLINK-22367
> URL: https://issues.apache.org/jira/browse/FLINK-22367
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.13.0
> Reporter: Dawid Wysakowicz
> Assignee: Chesnay Schepler
> Priority: Critical
> Labels: test-stability
> Fix For: 1.13.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=16818&view=logs&j=2c3cbe13-dee0-5837-cf47-3053da9a8a78&t=2c7d57b9-7341-5a87-c9af-2cf7cc1a37dc&l=3844
> {code}
> [ERROR] Tests run: 7, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
> 13.135 s <<< FAILURE! - in
> org.apache.flink.runtime.jobmaster.JobMasterStopWithSavepointITCase
> Apr 19 22:28:44 [ERROR]
> terminateWithSavepointWithoutComplicationsShouldSucceedAndLeadJobToFinished(org.apache.flink.runtime.jobmaster.JobMasterStopWithSavepointITCase)
> Time elapsed: 10.237 s <<< ERROR!
> Apr 19 22:28:44 java.util.concurrent.ExecutionException:
> java.util.concurrent.TimeoutException: Invocation of public default
> java.util.concurrent.CompletableFuture
> org.apache.flink.runtime.webmonitor.RestfulGateway.stopWithSavepoint(org.apache.flink.api.common.JobID,java.lang.String,boolean,org.apache.flink.api.common.time.Time)
> timed out.
> Apr 19 22:28:44 at
> java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395)
> Apr 19 22:28:44 at
> java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999)
> Apr 19 22:28:44 at
> org.apache.flink.runtime.jobmaster.JobMasterStopWithSavepointITCase.stopWithSavepointNormalExecutionHelper(JobMasterStopWithSavepointITCase.java:123)
> Apr 19 22:28:44 at
> org.apache.flink.runtime.jobmaster.JobMasterStopWithSavepointITCase.terminateWithSavepointWithoutComplicationsShouldSucceedAndLeadJobToFinished(JobMasterStopWithSavepointITCase.java:111)
> Apr 19 22:28:44 at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Apr 19 22:28:44 at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 19 22:28:44 at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 19 22:28:44 at
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> Apr 19 22:28:44 at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> Apr 19 22:28:44 at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 19 22:28:44 at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> Apr 19 22:28:44 at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 19 22:28:44 at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> Apr 19 22:28:44 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> Apr 19 22:28:44 at
> org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Apr 19 22:28:44 at
> org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
> Apr 19 22:28:44 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> Apr 19 22:28:44 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> Apr 19 22:28:44 at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> Apr 19 22:28:44 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> Apr 19 22:28:44 at
> org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:48)
> Apr 19 22:28:44 at org.junit.rules.RunRules.evaluate(RunRules.java:20)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> Apr 19 22:28:44 at org.junit.runners.Suite.runChild(Suite.java:128)
> Apr 19 22:28:44 at org.junit.runners.Suite.runChild(Suite.java:27)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> Apr 19 22:28:44 at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCore.run(JUnitCore.java:55)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.createRequestAndRun(JUnitCoreWrapper.java:137)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.executeEager(JUnitCoreWrapper.java:107)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:83)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCoreWrapper.execute(JUnitCoreWrapper.java:75)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.junitcore.JUnitCoreProvider.invoke(JUnitCoreProvider.java:158)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> Apr 19 22:28:44 at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)