[ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135506#comment-17135506
 ] 

Yun Tang commented on FLINK-18238:
----------------------------------

Thanks for [~kevin.cyj] for investigating this.

[~pnowojski], I think this is because current implementation would not emit 
checkpoint barrier or CancelCheckpointMarker downstream if it found the 
checkpoint has been aborted (see [SubtaskCheckpointCoordinatorImpl 
code|https://github.com/apache/flink/blob/35f95f5ac02c6014cdc8ef714ca66ad7e2cfdd5b/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java#L240-L252]).

I compare the implementation of current Flink with our internal Blink and 
noticed that we would broadcast barrier and then return to ignore follow-up 
sync and async phase in Blink. I think that's why we did not meet this problem 
internally.

> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
>                 Key: FLINK-18238
>                 URL: https://issues.apache.org/jira/browse/FLINK-18238
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Piotr Nowojski
>            Assignee: Yingjie Cao
>            Priority: Blocker
>             Fix For: 1.11.0
>
>         Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to