[
https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863760#comment-16863760
]
Piotr Nowojski commented on FLINK-8871:
---------------------------------------
Thanks for bringing this to my attention [~yunta]. Unfortunately there are some
more conflicts in the pipeline. Probably nothing that is completely
incompatible but we are working on the same classes. In order to finalize
[~sunhaibotb]'s `TwoInputStreamSelectable`, I have to rework
`CheckpointBarrierHandler` implementation quite a bit. I two biggest sources of
conflicts:
1. I'm splitting the existing code of {{BarrierBuffer}} into two separate
classes - I'm moving the code that actually performs the alignment to a
separate class {{CheckpointBarrierAligner}}, leaving {{BarrierBuffer}} a quite
thin class. This is in order to support two separate instances of
{{BarrierBuffer}} for {{TwoInputStreamOperator}} - that means the code tracking
alignment, blocking channels etc must be extracted two another class, that will
be shared by those two {{BarrierBufffer}} instances.
2. I'm planning to deduplicate the code by a bit and completely remove
{{BarrierTracker}} class and unify it with {{BarrierBuffer}}. You can even see
this problem in your PR, where you are implementing handling of {{private final
NavigableSet<Long> abortedCheckpointIds;}} twice in those two classes.
> Checkpoint cancellation is not propagated to stop checkpointing threads on
> the task manager
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-8871
> URL: https://issues.apache.org/jira/browse/FLINK-8871
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
> Reporter: Stefan Richter
> Assignee: Yun Tang
> Priority: Critical
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Flink currently lacks any form of feedback mechanism from the job manager /
> checkpoint coordinator to the tasks when it comes to failing a checkpoint.
> This means that running snapshots on the tasks are also not stopped even if
> their owning checkpoint is already cancelled. Two examples for cases where
> this applies are checkpoint timeouts and local checkpoint failures on a task
> together with a configuration that does not fail tasks on checkpoint failure.
> Notice that those running snapshots do no longer account for the maximum
> number of parallel checkpoints, because their owning checkpoint is considered
> as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation
> where the next checkpoints already started, while the abandoned checkpoint
> thread from a previous checkpoint is still lingering around running. This
> scenario can potentially cascade: many parallel checkpoints will slow down
> checkpointing and make timeouts even more likely.
>
> A possible solution is introducing a {{cancelCheckpoint}} method as
> counterpart to the {{triggerCheckpoint}} method in the task manager gateway,
> which is invoked by the checkpoint coordinator as part of cancelling the
> checkpoint.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)