Hi everyone, I would like to start a discussion on FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API [1].
Currently, checkpoint configuration (checkpointInterval, checkpointTimeout) is immutable after job submission. This creates significant operational challenges for long-running streaming jobs: 1. Cascading checkpoint failures cannot be resolved without restarting the job, causing data reprocessing delays. 2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded on timeout — wasting all I/O work and potentially creating a failure loop for large-state jobs. 3. Static configuration cannot adapt to variable workloads at runtime. FLIP-571 proposes a new REST API endpoint: PATCH /jobs/:jobid/checkpoints/configuration Key design points: - Timeout changes apply immediately to in-flight checkpoints by rescheduling their canceller timers, saving near-complete checkpoints from being discarded. - Interval changes take effect on the next checkpoint trigger cycle. - Configuration overrides are persisted to ExecutionPlanStore (following the JobResourceRequirements pattern) and automatically restored after failover. For more details, please refer to the FLIP [1]. Looking forward to your feedback and suggestions! [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API Best regards, Jiangang Liu
