Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Jiangang Liu Wed, 15 Apr 2026 00:23:13 -0700

Thanks, xiongraorao. This is a valid theoretical concern, but in practice
the risk is mitigated by the existing design:


   1. The PATCH handler forwards updates to CheckpointCoordinator on
the *JobMaster
   main thread*. Both fields are written in a single method invocation
   within synchronized(lock), so any reader that also holds the lock sees
   both updates atomically.
   2. The volatile keyword is primarily a safety net for unsynchronized
   reads (e.g., metrics reporting or logging). The critical scheduling and
   canceller logic all operates within synchronized(lock).
   3. Even in the worst case of a transient inconsistent read, the next
   periodic trigger cycle (seconds later) will observe both correct values.
   There is no persistent corruption.


Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道：

> Thanks for the design. I think it is a common requirement in our
> production. I have one question——Could dynamic timeout extension mask
> genuine checkpoint hangs, making problems harder to detect?
>
> The primary motivation is allowing SREs to extend `checkpointTimeout`
> to save near-complete checkpoints. However, this introduces an
> operational anti-pattern risk: operators might habitually extend
> timeouts instead of investigating root causes (e.g., state backend
> degradation, skewed key distribution, slow sinks). A checkpoint
> "stuck" at 95% might actually indicate a genuine hang in one subtask,
> and extending the timeout only delays the inevitable failure while
> consuming additional resources (holding barriers, buffering data).
>
> This could turn a clear, fast-failing signal into a slow, ambiguous
> one — exactly the opposite of what good observability requires.
>
> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道：
> >
> >
> > It is a very useful feature in the production. Once a checkpoint fails,
> the job may stuck and the next checkpoint may fail without updating the
> config. The only thing I care is the thread Safety - will `volatile` fields
> cause consistency issues between `checkpointInterval` and
> `checkpointTimeout`?
> >
> > The FLIP proposes changing `checkpointInterval` and `checkpointTimeout`
> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile`
> guarantees visibility, it does not guarantee atomicity across multiple
> fields. If a user updates both values simultaneously via a single PATCH
> request, there is a window where `CheckpointCoordinator` could observe the
> new `checkpointInterval` but the old `checkpointTimeout` (or vice versa).
> This partial-update visibility could lead to unexpected behavior — for
> example, a shorter interval combined with the old (shorter) timeout,
> causing checkpoints to be triggered more frequently and immediately timeout.
> >
> > > 2026年3月24日 16:29，Jiangang Liu <[email protected]> 写道：
> > >
> > > Hi everyone,
> > >
> > > I would like to start a discussion on FLIP-571: Support Dynamically
> > > Updating Checkpoint Configuration at Runtime via REST API [1].
> > >
> > > Currently, checkpoint configuration (checkpointInterval,
> checkpointTimeout)
> > > is immutable after job submission. This creates significant operational
> > > challenges for long-running streaming jobs:
> > >
> > >   1. Cascading checkpoint failures cannot be resolved without
> restarting
> > >   the
> > >   job, causing data reprocessing delays.
> > >   2. Near-complete checkpoints (e.g., 95% persisted) are entirely
> discarded
> > >   on timeout — wasting all I/O work and potentially creating a failure
> > >   loop for large-state jobs.
> > >   3. Static configuration cannot adapt to variable workloads at
> runtime.
> > >
> > > FLIP-571 proposes a new REST API endpoint:
> > >
> > > PATCH /jobs/:jobid/checkpoints/configuration
> > >
> > > Key design points:
> > >
> > >   - Timeout changes apply immediately to in-flight checkpoints by
> > >   rescheduling their canceller timers, saving near-complete checkpoints
> > >   from being discarded.
> > >   - Interval changes take effect on the next checkpoint trigger cycle.
> > >   - Configuration overrides are persisted to ExecutionPlanStore
> (following
> > >   the JobResourceRequirements pattern) and automatically restored after
> > >   failover.
> > >
> > > For more details, please refer to the FLIP [1].
> > >
> > > Looking forward to your feedback and suggestions!
> > >
> > > [1]
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
> > >
> > > Best regards,
> > > Jiangang Liu
> >
>

Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Reply via email to