Stefan Richter created FLINK-8871:
-------------------------------------
Summary: Checkpoint cancellation is not propagated to stop
checkpointing threads on the task manager
Key: FLINK-8871
URL: https://issues.apache.org/jira/browse/FLINK-8871
Project: Flink
Issue Type: Bug
Components: State Backends, Checkpointing
Affects Versions: 1.4.1, 1.3.2, 1.5.0
Reporter: Stefan Richter
Fix For: 1.6.0
Flink currently lacks any form of feedback mechanism from the job manager /
checkpoint coordinator to the tasks when it comes to failing a checkpoint. This
means that running snapshots on the tasks are also not stopped even if their
owning checkpoint is already cancelled. Two examples for cases where this
applies are checkpoint timeouts and local checkpoint failures on a task
together with a configuration that does not fail tasks on checkpoint failure.
Notice that those running snapshots do no longer account for the maximum number
of parallel checkpoints, because their owning checkpoint is considered as
cancelled.
Not stopping the task's snapshot thread can lead to a problematic situation
where the next checkpoints already started, while the abandoned checkpoint
thread from a previous checkpoint is still lingering around running. This
scenario can potentially cascade: many parallel checkpoints will slow down
checkpointing and make timeouts even more likely.
A possible solution is introducing a {{cancelCheckpoint}} method as
counterpart to the {{triggerCheckpoint}} method in the task manager gateway,
which is invoked by the checkpoint coordinator as part of cancelling the
checkpoint.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)