[
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135506#comment-17135506
]
Yun Tang commented on FLINK-18238:
----------------------------------
Thanks for [~kevin.cyj] for investigating this.
[~pnowojski], I think this is because current implementation would not emit
checkpoint barrier or CancelCheckpointMarker downstream if it found the
checkpoint has been aborted (see [SubtaskCheckpointCoordinatorImpl
code|https://github.com/apache/flink/blob/35f95f5ac02c6014cdc8ef714ca66ad7e2cfdd5b/flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java#L240-L252]).
I compare the implementation of current Flink with our internal Blink and
noticed that we would broadcast barrier and then return to ignore follow-up
sync and async phase in Blink. I think that's why we did not meet this problem
internally.
> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
> Key: FLINK-18238
> URL: https://issues.apache.org/jira/browse/FLINK-18238
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.0
> Reporter: Piotr Nowojski
> Assignee: Yingjie Cao
> Priority: Blocker
> Fix For: 1.11.0
>
> Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)