[ 
https://issues.apache.org/jira/browse/FLINK-22003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17312877#comment-17312877
 ] 

Yun Gao commented on FLINK-22003:
---------------------------------

Hi, I think [~roman_khachatryan] is exactly right for the conditions of 
successfully triggering, but I read the log and found there might be still some 
issues since the source task 3/5, 4/5 and 5/5 are transited to RUNNING in JM 
side after the checkpoint 11 is triggered. 

After analyzed the log, I think it is very likely due to that when computing 
the tasks to trigger, the failover is still not done yet, thus some sources 
might acquired and checked against their prior execution. And it seems 
FLINK-21067 indeed has problem in that it ignores the tasks in 
CANCELING/CANCELED state, which is different from the prior logic. If this case 
happens, the prior execution would fail triggering and cause the checkpoint 
would have to wait for timeout to be canceled. Very sorry for introducing the 
bug and I'll open a PR for it today to change the condition to only allows for 
RUNNING and FINISHED status. 

But it also comes to me that there _might be_ another case: when we compute 
tasks to trigger and checking status, the failover might not happen yet, then 
all the tasks are running and then the checkpoint would be triggered. The 
triggering would also fail due to the prior execution is gone and cause the 
checkpoint to wait for timeout. This behavior should already exists before  
FLINK-21067 if the case did exist. I'll have a double confirmation about this 
case.

> UnalignedCheckpointITCase fail
> ------------------------------
>
>                 Key: FLINK-22003
>                 URL: https://issues.apache.org/jira/browse/FLINK-22003
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.13.0
>            Reporter: Guowei Ma
>            Priority: Major
>              Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=15601&view=logs&j=119bbba7-f5e3-5e08-e72d-09f1529665de&t=7dc1f5a9-54e1-502e-8b02-c7df69073cfc&l=4142
> {code:java}
> [ERROR] execute[parallel pipeline with remote channels, p = 
> 5](org.apache.flink.test.checkpointing.UnalignedCheckpointITCase)  Time 
> elapsed: 60.018 s  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 60000 
> milliseconds
>       at sun.misc.Unsafe.park(Native Method)
>       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>       at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>       at 
> java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>       at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>       at 
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1859)
>       at 
> org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1839)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1822)
>       at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointTestBase.execute(UnalignedCheckpointTestBase.java:138)
>       at 
> org.apache.flink.test.checkpointing.UnalignedCheckpointITCase.execute(UnalignedCheckpointITCase.java:184)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
>       at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to