[jira] [Comment Edited] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Piotr Nowojski (Jira) Mon, 15 Jun 2020 04:59:28 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135790#comment-17135790
 ]


Piotr Nowojski edited comment on FLINK-18238 at 6/15/20, 11:58 AM:
-------------------------------------------------------------------

[~yunta] as I wrote above, broadcasting {{CancelCheckpointMarker}} downstream 
can probably cause checkpoint failures of previous checkpoints, that would 
complete successfully otherwise.

Current semantic of {{CancelCheckpointMarker(N)}} is that it aborts all 
checkpoints <= N. To preserve the previous behaviour, we would need more 
complicate logic, either in processing {{CancelCheckpointMarker}} or in 
broadcasting it (we could postpone broadcasting).

Maybe it's fine to change the previous behaviour, but that wasn't the intention 
of the FLINK-8871.

Either way, I think it would be safer to revert this change and re-do it on 
master branch for 1.12, as who knows what other problems can pop up even in the 
simplest solution:
{quote}
the clear solution is to broadcast CancelCheckpointMarker downside. 
{quote} 
and we are now blocking 1.11 release.
{quote}
since the SubtaskCheckpointCoordinator can not access the component of 
CheckpointBarrierHandler directly
{quote}
[~zjwang] this would be easy to solve, assuming that we would want to go this 
direction.


was (Author: pnowojski):
[~yunta] as I wrote above, broadcasting {{CancelCheckpointMarker}} downstream 
can probably cause checkpoint failures of previous checkpoints, that would 
complete successfully otherwise.

Current semantic of {{CancelCheckpointMarker(N)}} is that it aborts all 
checkpoints <= N. To preserve the previous behaviour, we would need more 
complicate logic, either in processing {{CancelCheckpointMarker}} or in 
broadcasting it (we could postpone broadcasting).

Maybe it's fine to change the previous behaviour, but that wasn't the intention 
of the FLINK-8871.

Either way, I think it would be safer to revert this change and re-do it on 
master branch for 1.12, as who knows what other problems can pop up even in the 
simplest solution:
{quote}
the clear solution is to broadcast CancelCheckpointMarker downside. 
{quote} 
and we are now blocking 1.11 release.
{quote}
since the SubtaskCheckpointCoordinator can not access the component of 
CheckpointBarrierHandler directly
{quote}
[~zjwang] this would be easy to solve.

> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
>                 Key: FLINK-18238
>                 URL: https://issues.apache.org/jira/browse/FLINK-18238
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Piotr Nowojski
>            Assignee: Yingjie Cao
>            Priority: Blocker
>             Fix For: 1.11.0
>
>         Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Reply via email to