[
https://issues.apache.org/jira/browse/FLINK-18238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135534#comment-17135534
]
Piotr Nowojski edited comment on FLINK-18238 at 6/15/20, 7:34 AM:
------------------------------------------------------------------
{quote}
The downstream task can receive the abort rpc call from coordinator, but it can
not touch the `CheckpointBarrierHandler` to end the alignment and it only works
on `StreamTask` with sub task coordinator.
{quote}
Ok I get it. It would be I think OK to cancel the alignment directly from the
{{SubtaskCheckpointCoordinator}}, but as I wrote above, it might open up
possibilities for some race conditions with task not started.
{quote}
One possible option is to broadcast CancelCheckpointMarker to downstream side
when the upstream already aborted the checkpoint from RPC call, then the
downstream can end the alignment immediately. But one side effect is that the
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and
anther is via data stream in CheckpointBarrierHandler.
{quote}
This could be handled by checking if the checkpoint has already been aborted or
not.
But now that I think about it, what if there are multiple ongoing checkpoints
in the job graph: N, N+1, ..., N +5. What if checkpoint N+5 fails somewhere at
the head of the job graph, while other are still flowing through the job graph?
Without the abort RPC call, if a task completed checkpoints N, .., N+4, it
broadcasted checkpoint barriers for those checkpoints and then failed for
checkpoint N+5, those cancellation markers from this Task wouldn't be processed
by downstream Tasks (because of alignment and blocked channels) that are still
waiting for alignment on checkpoints N, ..., N+4. So checkpoints N, ..., N+4
could complete normally.
With the abort RPC call, cancellations can overtake the pending checkpoint
barriers, so in the before mentioned scenario, we would cancel all checkpoints,
from N to N+5. I'm not sure if this can happen on master as it is without
broadcasting {{CancelCheckpointMarker}}, but I think it could happen with
broadcasting.
was (Author: pnowojski):
{quote}
The downstream task can receive the abort rpc call from coordinator, but it can
not touch the `CheckpointBarrierHandler` to end the alignment and it only works
on `StreamTask` with sub task coordinator.
{quote}
Ok I get it. It would be I think OK to cancel the alignment directly from the
{{SubtaskCheckpointCoordinator}}, but as I wrote above, it might open up
possibilities for some race conditions with task not started.
{quote}
One possible option is to broadcast CancelCheckpointMarker to downstream side
when the upstream already aborted the checkpoint from RPC call, then the
downstream can end the alignment immediately. But one side effect is that the
CancelCheckpointMarker might be broadcasted twice, one is via RPC trigger and
anther is via data stream in CheckpointBarrierHandler.
{quote}
This could be I think handled by checking if the checkpoint has already been
aborted or not.
But now that I think about it, what if there are multiple ongoing checkpoints
in the job graph: N, N+1, ..., N +5. What if checkpoint N+5 fails somewhere at
the head of the job graph, while other are still flowing through the job graph?
Without the abort RPC call, if a task completed checkpoints N, .., N+4, it
broadcasted checkpoint barriers for those checkpoints and then failed for
checkpoint N+5, those cancellation markers from this Task wouldn't be processed
by downstream Tasks (because of alignment and blocked channels) that are still
waiting for alignment on checkpoints N, ..., N+4. So checkpoints N, ..., N+4
could complete normally.
With the abort RPC call, cancellations can overtake the pending checkpoint
barriers, so in the before mentioned scenario, we would cancel all checkpoints,
from N to N+5. I'm not sure if this can happen on master as it is without
broadcasting {{CancelCheckpointMarker}}, but I think it could happen with
broadcasting.
> RemoteChannelThroughputBenchmark deadlocks
> ------------------------------------------
>
> Key: FLINK-18238
> URL: https://issues.apache.org/jira/browse/FLINK-18238
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.11.0
> Reporter: Piotr Nowojski
> Assignee: Yingjie Cao
> Priority: Blocker
> Fix For: 1.11.0
>
> Attachments: consoleText_remote_benchmark_deadlock.txt
>
>
> In the last couple of days
> {{RemoteChannelThroughputBenchmark.remoteRebalance}} deadlocked for the
> second time:
> http://codespeed.dak8s.net:8080/job/flink-master-benchmarks/6019/
--
This message was sent by Atlassian Jira
(v8.3.4#803005)