[
https://issues.apache.org/jira/browse/FLINK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312877#comment-17312877
]
Yun Gao edited comment on FLINK-22003 at 4/6/21, 9:22 AM:
----------------------------------------------------------
Hi, I think [~roman_khachatryan] is exactly right for the conditions of
successfully triggering, but I read the log and found there might be still some
issues since the source task 3/5, 4/5 and 5/5 are transited to RUNNING in JM
side after the checkpoint 11 is triggered.
After analyzed the log, I think it is very likely due to that when computing
the tasks to trigger, the failover is still not done yet, thus some sources
might acquired and checked against their prior execution. And it seems
FLINK-21067 indeed has problem in that it ignores the tasks in
CANCELING/CANCELED state, which is different from the prior logic. If this case
happens, the prior execution would fail triggering and cause the checkpoint
would have to wait for timeout to be canceled. Very sorry for introducing the
bug and I'll open a PR for it today to change the condition to only allows for
RUNNING status (Although we would also trigger checkpoint after tasks finished,
but the finished tasks should not be in the list of tasks to trigger).
But it also comes to me that there _might be_ another case: when we compute
tasks to trigger and checking status, the failover might not happen yet, then
all the tasks are running and then the checkpoint would be triggered. The
triggering would also fail due to the prior execution is gone and cause the
checkpoint to wait for timeout. This behavior should already exists before
FLINK-21067 if the case did exist. I'll have a double confirmation about this
case.
was (Author: gaoyunhaii):
Hi, I think [~roman_khachatryan] is exactly right for the conditions of
successfully triggering, but I read the log and found there might be still some
issues since the source task 3/5, 4/5 and 5/5 are transited to RUNNING in JM
side after the checkpoint 11 is triggered.
After analyzed the log, I think it is very likely due to that when computing
the tasks to trigger, the failover is still not done yet, thus some sources
might acquired and checked against their prior execution. And it seems
FLINK-21067 indeed has problem in that it ignores the tasks in
CANCELING/CANCELED state, which is different from the prior logic. If this case
happens, the prior execution would fail triggering and cause the checkpoint
would have to wait for timeout to be canceled. Very sorry for introducing the
bug and I'll open a PR for it today to change the condition to only allows for
RUNNING and FINISHED status.
But it also comes to me that there _might be_ another case: when we compute
tasks to trigger and checking status, the failover might not happen yet, then
all the tasks are running and then the checkpoint would be triggered. The
triggering would also fail due to the prior execution is gone and cause the
checkpoint to wait for timeout. This behavior should already exists before
FLINK-21067 if the case did exist. I'll have a double confirmation about this
case.
> UnalignedCheckpointITCase fail
> ------------------------------
>
> Key: FLINK-22003
> URL: https://issues.apache.org/jira/browse/FLINK-22003
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.13.0
> Reporter: Guowei Ma
> Assignee: Roman Khachatryan
> Priority: Major
> Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15601&view=logs&j=119bbba7-f5e3-5e08-e72d-09f1529665de&t=7dc1f5a9-54e1-502e-8b02-c7df69073cfc&l=4142
> {code:java}
> [ERROR] execute[parallel pipeline with remote channels, p =
> 5](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase) Time
> elapsed: 60.018 s <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 60000
> milliseconds
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
> at
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> at
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
> at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1859)
> at
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
> at
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:138)
> at
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:184)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
> at
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)