[jira] [Comment Edited] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Piotr Nowojski (Jira) Mon, 15 Jun 2020 00:50:03 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135534#comment-17135534
 ]


Piotr Nowojski edited comment on FLINK-18238 at 6/15/20, 7:34 AM:
------------------------------------------------------------------

{quote}
The downstream task can receive the abort rpc call from coordinator, but it can 
not touch the `CheckpointBarrierHandler` to end the alignment and it only works 
on `StreamTask` with sub task coordinator.
{quote}
Ok I get it. It would be I think OK to cancel the alignment directly from the 
{{SubtaskCheckpointCoordinator}}, but as I wrote above, it might open up 
possibilities for some race conditions with task not started.
{quote}
One possible option is to broadcast CancelCheckpointMarker to downstream side 
when the upstream already aborted the checkpoint from RPC call, then the 
downstream can end the alignment immediately. But one side effect is that the 
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and 
anther is via data stream in CheckpointBarrierHandler.
{quote}
This could be handled by checking if the checkpoint has already been aborted or 
not.

But now that I think about it, what if there are multiple ongoing checkpoints 
in the job graph: N, N+1, ..., N +5. What if checkpoint N+5 fails somewhere at 
the head of the job graph, while other are still flowing through the job graph? 
Without the abort RPC call, if a task completed checkpoints N, .., N+4, it 
broadcasted checkpoint barriers for those checkpoints and then failed for 
checkpoint N+5, those cancellation markers from this Task wouldn't be processed 
by downstream Tasks (because of alignment and blocked channels) that are still 
waiting for alignment on checkpoints N, ..., N+4. So checkpoints N, ..., N+4 
could complete normally.

With the abort RPC call, cancellations can overtake the pending checkpoint 
barriers, so in the before mentioned scenario, we would cancel all checkpoints, 
from N to N+5. I'm not sure if this can happen on master as it is without 
broadcasting {{CancelCheckpointMarker}}, but I think it could happen with 
broadcasting.


was (Author: pnowojski):
{quote}
The downstream task can receive the abort rpc call from coordinator, but it can 
not touch the `CheckpointBarrierHandler` to end the alignment and it only works 
on `StreamTask` with sub task coordinator.
{quote}
Ok I get it. It would be I think OK to cancel the alignment directly from the 
{{SubtaskCheckpointCoordinator}}, but as I wrote above, it might open up 
possibilities for some race conditions with task not started.
{quote}
One possible option is to broadcast CancelCheckpointMarker to downstream side 
when the upstream already aborted the checkpoint from RPC call, then the 
downstream can end the alignment immediately. But one side effect is that the 
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and 
anther is via data stream in CheckpointBarrierHandler.
{quote}
This could be I think handled by checking if the checkpoint has already been 
aborted or not.

But now that I think about it, what if there are multiple ongoing checkpoints 
in the job graph: N, N+1, ..., N +5. What if checkpoint N+5 fails somewhere at 
the head of the job graph, while other are still flowing through the job graph? 
Without the abort RPC call, if a task completed checkpoints N, .., N+4, it 
broadcasted checkpoint barriers for those checkpoints and then failed for 
checkpoint N+5, those cancellation markers from this Task wouldn't be processed 
by downstream Tasks (because of alignment and blocked channels) that are still 
waiting for alignment on checkpoints N, ..., N+4. So checkpoints N, ..., N+4 
could complete normally.

With the abort RPC call, cancellations can overtake the pending checkpoint 
barriers, so in the before mentioned scenario, we would cancel all checkpoints, 
from N to N+5. I'm not sure if this can happen on master as it is without 
broadcasting {{CancelCheckpointMarker}}, but I think it could happen with 
broadcasting.

> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
>                 Key: FLINK-18238
>                 URL: https://issues.apache.org/jira/browse/FLINK-18238
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Piotr Nowojski
>            Assignee: Yingjie Cao
>            Priority: Blocker
>             Fix For: 1.11.0
>
>         Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days 
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the 
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-18238) RemoteChannelThroughputBenchmark deadlocks

Reply via email to