Yunfeng Zhou created FLINK-34519: ------------------------------------ Summary: Refine checkpoint scheduling and canceling logic Key: FLINK-34519 URL: https://issues.apache.org/jira/browse/FLINK-34519 Project: Flink Issue Type: Technical Debt Components: Runtime / Checkpointing Affects Versions: 1.20.0 Reporter: Yunfeng Zhou
In the current implementation, CheckpointCoordinator#startCheckpointScheduler would stop the checkpoint scheduler before starting it, and CheckpointCoordinator#stopCheckpointScheduler would cancel all ongoing and pending checkpoints. When a stop-with-savepoint request is received, checkpoint coordinator would trigger stopCheckpointScheduler before creating the savepoint, and start the scheduler afterwards if the savepoint fails. The problem with this behavior is that it mixed up different checkpointing types. For example, stopCheckpointScheduler() only needs to cancel previous periodic checkpoints, while the current behavior cancels ongoing savepoints as well. This behavior is still acceptable for now, given that periodic checkpointing is enabled so long as a job is running, and two users would hardly trigger savepoints at the same time. However, as the Batch-Streaming Unification optimizations need to change some of these assumptions, the checkpoint coordinator should fix this problem. To be exact, checkpoint coordinator should at least distinguish between the following semantics. - Periodic checkpoint is enabled to ensure that failover recovery time should be kept within a time limit. - Periodic checkpoint is disabled to reduce corresponding performance overhead, but the ability to checkpoint still exists and users can trigger a savepoint anytime. - Checkpoint or savepoint is not allowed due to job status or topological requirements. It should also be supported for a Flink job to change between the checkpointing semantics mentioned above dynamically during runtime. Besides, checkpoints canceled in stopCheckpointScheduler() would fail with an error message saying "Checkpoint Coordinator is suspending", which is ambiguous for debugging. The detailed reason should be recorded as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)