[
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135372#comment-17135372
]
Zhijiang commented on FLINK-18238:
----------------------------------
[~kevin.cyj] already finalized the root cause which was introduced by
FLINK-8871.
When the checkpoint coordinator received the aborted checkpoint RPC from one
task, it will send the abort RPC call to all the remaining tasks to avoid
useless checkpoint happen.
If the respective checkpoint has not performed on
SubtaskCheckpointCoordinatorImpl side, it will store this aborted id to exit
the later checkpoint directly. But the downstream side still waits for barrier
alignment and can not receive any checkpoint barrier or CancelCheckpointMarker
any more from this aborted upstream. Then it will cause gradually backpressure
until completely deadlock.
One possible option is to broadcast CancelCheckpointMarker to downstream side
when the upstream already aborted the checkpoint from RPC call, then the
downstream can end the alignment immediately. But one side effect is that the
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and
anther is via data stream in CheckpointBarrierHandler.
> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
> Key: FLINK-18238
> URL: https://issues.apache.org/jira/browse/FLINK-18238
> Project: Flink
> Issue Type: Bug
> Components: Benchmarks, Runtime / Network
> Affects Versions: 1.11.0
> Reporter: Piotr Nowojski
> Assignee: Yingjie Cao
> Priority: Blocker
> Fix For: 1.11.0
>
> Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)