[
https://issues.apache.org/jira/browse/FLINK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312302#comment-17312302
]
Arvid Heise commented on FLINK-22003:
-------------------------------------
This one is really strange:
Usually, when we trigger a checkpoint, we see some actions on the source
{noformat}
22:24:51,290 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 1 (type=CHECKPOINT) @ 1616883891289 for job
bb1c23aad807944d0a54775098106574.
22:24:51,290 [SourceCoordinator-Source: source] INFO
org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase [] -
snapshotState EnumeratorState{unassignedSplits=[], numRestarts=0,
numCompletedCheckpoints=0}
22:24:51,291 [Flink Netty Server (0) Thread 0] TRACE
org.apache.flink.runtime.io.network.logger.NetworkActionsLogger [] - [Source:
source (2/5)#0 (8f1ca6eb04b6e2341c658cc0b1ac7c6c)]
PipelinedSubpartition#pollBuffer Buffer{size=38, hash=924008396} @
ResultSubpartitionInfo{partitionIdx=0, subPartitionIdx=0}
{noformat}
In this case, after checkpoint 11 is triggered nothing happens.
{noformat}
22:24:54,694 [ Checkpoint Timer] INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
checkpoint 11 (type=CHECKPOINT) @ 1616883893885 for job
bb1c23aad807944d0a54775098106574.
22:24:54,694 [ failing-map (5/5)#4] INFO
org.apache.flink.runtime.taskmanager.Task [] - failing-map
(5/5)#4 (7c0e288b2cd57831596d58e8ce31e435) switched from CREATED to DEPLOYING.
{noformat}
Actually, it should have been canceled, as obviously not all tasks are running
similar to
{noformat}
22:24:51,044 [ Checkpoint Timer] WARN
org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Failed to
trigger checkpoint for job bb1c23aad807944d0a54775098106574.)
org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint triggering
task Source: source (1/5) of job bb1c23aad807944d0a54775098106574 has not being
executed at the moment. Aborting checkpoint. Failure reason: Not all required
tasks are currently running.
{noformat}
I'm currently assuming that there is a race condition in the code of
FLINK-21067.
> UnalignedCheckpointITCase fail
> ------------------------------
>
> Key: FLINK-22003
> URL: https://issues.apache.org/jira/browse/FLINK-22003
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.13.0
> Reporter: Guowei Ma
> Priority: Major
> Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15601&view=logs&j=119bbba7-f5e3-5e08-e72d-09f1529665de&t=7dc1f5a9-54e1-502e-8b02-c7df69073cfc&l=4142
> {code:java}
> [ERROR] execute[parallel pipeline with remote channels, p =
> 5](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time
> elapsed: 60.018 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 60000
> milliseconds
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1859)
> at
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
> at
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:138)
> at
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:184)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)