[jira] [Commented] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Piotr Nowojski (Jira) Tue, 16 Jun 2020 03:23:17 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136511#comment-17136511
 ]


Piotr Nowojski commented on FLINK-18238:
----------------------------------------

Copying the result of an online discussion: we decided to go with broadcasting 
checkpoint cancellation markers from 
{{SubtaskCheckpointCoordinatorImpl#checkpointState}} in the case when 
{{notifyCheckpointAborted}} RPC call was received before it the checkpoint was 
triggered. This guarantees that downstream tasks will always eventually stop 
the alignment. 

We could further optimise the process by cancelling the ongoing alignment of 
the task, once it receives {{notifyCheckpointAborted}} RPC, but that would 
require some more extensive changes that we do not need to do right now.

> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
>                 Key: FLINK-18238
>                 URL: https://issues.apache.org/jira/browse/FLINK-18238
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Piotr Nowojski
>            Assignee: Yingjie Cao
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>         Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Reply via email to