[jira] [Commented] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Zhijiang (Jira) Sun, 14 Jun 2020 20:25:04 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135372#comment-17135372
 ]


Zhijiang commented on FLINK-18238:
----------------------------------

[~kevin.cyj] already finalized the root cause which was introduced by 
FLINK-8871. 

When the checkpoint coordinator received the aborted checkpoint RPC from one 
task, it will send the abort RPC call to all the remaining tasks to avoid 
useless checkpoint happen.

If the respective checkpoint has not performed on 
SubtaskCheckpointCoordinatorImpl side, it will store this aborted id to exit 
the later checkpoint directly. But the downstream side still waits for barrier 
alignment and can not receive any checkpoint barrier or CancelCheckpointMarker 
any more from this aborted upstream. Then it will cause gradually backpressure 
until completely deadlock.   

One possible option is to broadcast CancelCheckpointMarker to downstream side 
when the upstream already aborted the checkpoint from RPC call, then the 
downstream can end the alignment immediately. But one side effect is that the 
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and 
anther is via data stream in CheckpointBarrierHandler.

> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
>                 Key: FLINK-18238
>                 URL: https://issues.apache.org/jira/browse/FLINK-18238
>             Project: Flink
>          Issue Type: Bug
>          Components: Benchmarks, Runtime / Network
>    Affects Versions: 1.11.0
>            Reporter: Piotr Nowojski
>            Assignee: Yingjie Cao
>            Priority: Blocker
>             Fix For: 1.11.0
>
>         Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Reply via email to