Thanks for the design. I think it is a common requirement in our production. I have one question——Could dynamic timeout extension mask genuine checkpoint hangs, making problems harder to detect?
The primary motivation is allowing SREs to extend `checkpointTimeout` to save near-complete checkpoints. However, this introduces an operational anti-pattern risk: operators might habitually extend timeouts instead of investigating root causes (e.g., state backend degradation, skewed key distribution, slow sinks). A checkpoint "stuck" at 95% might actually indicate a genuine hang in one subtask, and extending the timeout only delays the inevitable failure while consuming additional resources (holding barriers, buffering data). This could turn a clear, fast-failing signal into a slow, ambiguous one — exactly the opposite of what good observability requires. 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道: > > > It is a very useful feature in the production. Once a checkpoint fails, the > job may stuck and the next checkpoint may fail without updating the config. > The only thing I care is the thread Safety - will `volatile` fields cause > consistency issues between `checkpointInterval` and `checkpointTimeout`? > > The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` from > `final` to `volatile` in `CheckpointCoordinator`. While `volatile` guarantees > visibility, it does not guarantee atomicity across multiple fields. If a user > updates both values simultaneously via a single PATCH request, there is a > window where `CheckpointCoordinator` could observe the new > `checkpointInterval` but the old `checkpointTimeout` (or vice versa). This > partial-update visibility could lead to unexpected behavior — for example, a > shorter interval combined with the old (shorter) timeout, causing checkpoints > to be triggered more frequently and immediately timeout. > > > 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: > > > > Hi everyone, > > > > I would like to start a discussion on FLIP-571: Support Dynamically > > Updating Checkpoint Configuration at Runtime via REST API [1]. > > > > Currently, checkpoint configuration (checkpointInterval, checkpointTimeout) > > is immutable after job submission. This creates significant operational > > challenges for long-running streaming jobs: > > > > 1. Cascading checkpoint failures cannot be resolved without restarting > > the > > job, causing data reprocessing delays. > > 2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded > > on timeout — wasting all I/O work and potentially creating a failure > > loop for large-state jobs. > > 3. Static configuration cannot adapt to variable workloads at runtime. > > > > FLIP-571 proposes a new REST API endpoint: > > > > PATCH /jobs/:jobid/checkpoints/configuration > > > > Key design points: > > > > - Timeout changes apply immediately to in-flight checkpoints by > > rescheduling their canceller timers, saving near-complete checkpoints > > from being discarded. > > - Interval changes take effect on the next checkpoint trigger cycle. > > - Configuration overrides are persisted to ExecutionPlanStore (following > > the JobResourceRequirements pattern) and automatically restored after > > failover. > > > > For more details, please refer to the FLIP [1]. > > > > Looking forward to your feedback and suggestions! > > > > [1] > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API > > > > Best regards, > > Jiangang Liu >
