Thanks for Verne Deng's valuable question. This is a legitimate operational concern, but it is an inherent trade-off of any runtime tuning capability, not a flaw specific to this design:
1. *The status quo is worse.* Today, the only option is to restart the entire job, which causes data reprocessing and downstream impact. Extending the timeout is strictly less disruptive than a restart, even if the checkpoint ultimately fails. 2. *Observability is preserved.* The existing checkpoint metrics ( checkpointDuration, checkpointSize, per-subtask completion times) remain fully available. The FLIP does not suppress any signals — it only gives operators more time. 3. *Guardrails can be added incrementally.* Future iterations can introduce maximum timeout bounds or automatic alerting when dynamic overrides are active. This FLIP explicitly scopes Phase 1 to the mechanism; policy is a separate concern. The documentation and release notes should include best-practice guidance: use dynamic timeout extension as a *temporary bridge*, not a permanent workaround. Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道: > Thanks, xiongraorao. This is a valid theoretical concern, but in practice > the risk is mitigated by the existing design: > > 1. The PATCH handler forwards updates to CheckpointCoordinator on the > *JobMaster > main thread*. Both fields are written in a single method invocation > within synchronized(lock), so any reader that also holds the lock sees > both updates atomically. > 2. The volatile keyword is primarily a safety net for unsynchronized > reads (e.g., metrics reporting or logging). The critical scheduling and > canceller logic all operates within synchronized(lock). > 3. Even in the worst case of a transient inconsistent read, the next > periodic trigger cycle (seconds later) will observe both correct values. > There is no persistent corruption. > > > Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道: > >> Thanks for the design. I think it is a common requirement in our >> production. I have one question——Could dynamic timeout extension mask >> genuine checkpoint hangs, making problems harder to detect? >> >> The primary motivation is allowing SREs to extend `checkpointTimeout` >> to save near-complete checkpoints. However, this introduces an >> operational anti-pattern risk: operators might habitually extend >> timeouts instead of investigating root causes (e.g., state backend >> degradation, skewed key distribution, slow sinks). A checkpoint >> "stuck" at 95% might actually indicate a genuine hang in one subtask, >> and extending the timeout only delays the inevitable failure while >> consuming additional resources (holding barriers, buffering data). >> >> This could turn a clear, fast-failing signal into a slow, ambiguous >> one — exactly the opposite of what good observability requires. >> >> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道: >> > >> > >> > It is a very useful feature in the production. Once a checkpoint fails, >> the job may stuck and the next checkpoint may fail without updating the >> config. The only thing I care is the thread Safety - will `volatile` fields >> cause consistency issues between `checkpointInterval` and >> `checkpointTimeout`? >> > >> > The FLIP proposes changing `checkpointInterval` and `checkpointTimeout` >> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile` >> guarantees visibility, it does not guarantee atomicity across multiple >> fields. If a user updates both values simultaneously via a single PATCH >> request, there is a window where `CheckpointCoordinator` could observe the >> new `checkpointInterval` but the old `checkpointTimeout` (or vice versa). >> This partial-update visibility could lead to unexpected behavior — for >> example, a shorter interval combined with the old (shorter) timeout, >> causing checkpoints to be triggered more frequently and immediately timeout. >> > >> > > 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: >> > > >> > > Hi everyone, >> > > >> > > I would like to start a discussion on FLIP-571: Support Dynamically >> > > Updating Checkpoint Configuration at Runtime via REST API [1]. >> > > >> > > Currently, checkpoint configuration (checkpointInterval, >> checkpointTimeout) >> > > is immutable after job submission. This creates significant >> operational >> > > challenges for long-running streaming jobs: >> > > >> > > 1. Cascading checkpoint failures cannot be resolved without >> restarting >> > > the >> > > job, causing data reprocessing delays. >> > > 2. Near-complete checkpoints (e.g., 95% persisted) are entirely >> discarded >> > > on timeout — wasting all I/O work and potentially creating a failure >> > > loop for large-state jobs. >> > > 3. Static configuration cannot adapt to variable workloads at >> runtime. >> > > >> > > FLIP-571 proposes a new REST API endpoint: >> > > >> > > PATCH /jobs/:jobid/checkpoints/configuration >> > > >> > > Key design points: >> > > >> > > - Timeout changes apply immediately to in-flight checkpoints by >> > > rescheduling their canceller timers, saving near-complete >> checkpoints >> > > from being discarded. >> > > - Interval changes take effect on the next checkpoint trigger cycle. >> > > - Configuration overrides are persisted to ExecutionPlanStore >> (following >> > > the JobResourceRequirements pattern) and automatically restored >> after >> > > failover. >> > > >> > > For more details, please refer to the FLIP [1]. >> > > >> > > Looking forward to your feedback and suggestions! >> > > >> > > [1] >> > > >> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API >> > > >> > > Best regards, >> > > Jiangang Liu >> > >> >
