Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Jiangang Liu Wed, 15 Apr 2026 03:18:13 -0700

Thanks, Look_Y_Yd. These are reasonable architectural concerns, but the
design deliberately accepts these trade-offs for good reasons:


   1. *Proven pattern.* JobResourceRequirements already uses exactly this
   approach ($internal.job-resource-requirements inside ExecutionPlan). The
   pattern has been battle-tested in production across the adaptive scheduler
   and dynamic rescaling. Introducing a separate storage path would add
   operational complexity (new ZK/K8s paths, new cleanup logic, new recovery
   code paths) with minimal benefit.
   2. *Write amplification is bounded.* Dynamic checkpoint updates are rare
   operations (SRE-initiated, not automated). A few extra writes per hour to
   ZooKeeper is negligible compared to the checkpoint metadata writes that
   already happen every interval. If future parameters make updates more
   frequent, a dedicated lightweight store can be introduced.
   3. *Debuggability.* The GET /jobs/:jobid/checkpoints/config endpoint
   explicitly surfaces effective values. Additionally, the $internal.
   prefix convention clearly marks the field as a runtime override,
   distinguishing it from user-submitted configuration.


Look_Y_Y <[email protected]> 于2026年4月15日周三 18:13写道：

> Thanks for the FLIP. I like the dynamic way to control flink. But I am
> confused that why reuse `ExecutionPlan.getJobConfiguration()` instead of a
> dedicated storage path? What about storage bloat risk?
>
> The FLIP proposes persisting `JobCheckpointingOverrides` inside
> `ExecutionPlan.getJobConfiguration()` using the key
> `$internal.job-checkpoint-overrides`. This piggybacks on the existing
> `ExecutionPlan` blob in ZooKeeper/Kubernetes ConfigMap. Two concerns arise:
>
> 1. Coupling risk. Embedding runtime overrides inside the ExecutionPlan
> blurs the boundary between job definition (immutable after submission) and
> runtime state (mutable). This could cause confusion when debugging — the
> ExecutionPlan retrieved from the store may differ from the originally
> submitted plan.
> 2. Size and write amplification. Every dynamic update triggers a full
> `ExecutionPlan` re-serialization and write. For jobs with large execution
> plans (thousands of operators), this is a heavyweight operation for
> changing two numbers.
>
> > 2026年4月15日 15:32，Jiangang Liu <[email protected]> 写道：
> >
> > Thanks for Verne Deng's valuable question. This is a legitimate
> operational
> > concern, but it is an inherent trade-off of any runtime tuning
> capability,
> > not a flaw specific to this design:
> >
> >   1. *The status quo is worse.* Today, the only option is to restart the
> >   entire job, which causes data reprocessing and downstream impact.
> Extending
> >   the timeout is strictly less disruptive than a restart, even if the
> >   checkpoint ultimately fails.
> >   2. *Observability is preserved.* The existing checkpoint metrics (
> >   checkpointDuration, checkpointSize, per-subtask completion times)
> remain
> >   fully available. The FLIP does not suppress any signals — it only gives
> >   operators more time.
> >   3. *Guardrails can be added incrementally.* Future iterations can
> >   introduce maximum timeout bounds or automatic alerting when dynamic
> >   overrides are active. This FLIP explicitly scopes Phase 1 to the
> mechanism;
> >   policy is a separate concern.
> >
> > The documentation and release notes should include best-practice
> guidance:
> > use dynamic timeout extension as a *temporary bridge*, not a permanent
> > workaround.
> >
> > Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道：
> >
> >> Thanks, xiongraorao. This is a valid theoretical concern, but in
> practice
> >> the risk is mitigated by the existing design:
> >>
> >>   1. The PATCH handler forwards updates to CheckpointCoordinator on the
> *JobMaster
> >>   main thread*. Both fields are written in a single method invocation
> >>   within synchronized(lock), so any reader that also holds the lock sees
> >>   both updates atomically.
> >>   2. The volatile keyword is primarily a safety net for unsynchronized
> >>   reads (e.g., metrics reporting or logging). The critical scheduling
> and
> >>   canceller logic all operates within synchronized(lock).
> >>   3. Even in the worst case of a transient inconsistent read, the next
> >>   periodic trigger cycle (seconds later) will observe both correct
> values.
> >>   There is no persistent corruption.
> >>
> >>
> >> Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道：
> >>
> >>> Thanks for the design. I think it is a common requirement in our
> >>> production. I have one question——Could dynamic timeout extension mask
> >>> genuine checkpoint hangs, making problems harder to detect?
> >>>
> >>> The primary motivation is allowing SREs to extend `checkpointTimeout`
> >>> to save near-complete checkpoints. However, this introduces an
> >>> operational anti-pattern risk: operators might habitually extend
> >>> timeouts instead of investigating root causes (e.g., state backend
> >>> degradation, skewed key distribution, slow sinks). A checkpoint
> >>> "stuck" at 95% might actually indicate a genuine hang in one subtask,
> >>> and extending the timeout only delays the inevitable failure while
> >>> consuming additional resources (holding barriers, buffering data).
> >>>
> >>> This could turn a clear, fast-failing signal into a slow, ambiguous
> >>> one — exactly the opposite of what good observability requires.
> >>>
> >>> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道：
> >>>>
> >>>>
> >>>> It is a very useful feature in the production. Once a checkpoint
> fails,
> >>> the job may stuck and the next checkpoint may fail without updating the
> >>> config. The only thing I care is the thread Safety - will `volatile`
> fields
> >>> cause consistency issues between `checkpointInterval` and
> >>> `checkpointTimeout`?
> >>>>
> >>>> The FLIP proposes changing `checkpointInterval` and
> `checkpointTimeout`
> >>> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile`
> >>> guarantees visibility, it does not guarantee atomicity across multiple
> >>> fields. If a user updates both values simultaneously via a single PATCH
> >>> request, there is a window where `CheckpointCoordinator` could observe
> the
> >>> new `checkpointInterval` but the old `checkpointTimeout` (or vice
> versa).
> >>> This partial-update visibility could lead to unexpected behavior — for
> >>> example, a shorter interval combined with the old (shorter) timeout,
> >>> causing checkpoints to be triggered more frequently and immediately
> timeout.
> >>>>
> >>>>> 2026年3月24日 16:29，Jiangang Liu <[email protected]> 写道：
> >>>>>
> >>>>> Hi everyone,
> >>>>>
> >>>>> I would like to start a discussion on FLIP-571: Support Dynamically
> >>>>> Updating Checkpoint Configuration at Runtime via REST API [1].
> >>>>>
> >>>>> Currently, checkpoint configuration (checkpointInterval,
> >>> checkpointTimeout)
> >>>>> is immutable after job submission. This creates significant
> >>> operational
> >>>>> challenges for long-running streaming jobs:
> >>>>>
> >>>>>  1. Cascading checkpoint failures cannot be resolved without
> >>> restarting
> >>>>>  the
> >>>>>  job, causing data reprocessing delays.
> >>>>>  2. Near-complete checkpoints (e.g., 95% persisted) are entirely
> >>> discarded
> >>>>>  on timeout — wasting all I/O work and potentially creating a failure
> >>>>>  loop for large-state jobs.
> >>>>>  3. Static configuration cannot adapt to variable workloads at
> >>> runtime.
> >>>>>
> >>>>> FLIP-571 proposes a new REST API endpoint:
> >>>>>
> >>>>> PATCH /jobs/:jobid/checkpoints/configuration
> >>>>>
> >>>>> Key design points:
> >>>>>
> >>>>>  - Timeout changes apply immediately to in-flight checkpoints by
> >>>>>  rescheduling their canceller timers, saving near-complete
> >>> checkpoints
> >>>>>  from being discarded.
> >>>>>  - Interval changes take effect on the next checkpoint trigger cycle.
> >>>>>  - Configuration overrides are persisted to ExecutionPlanStore
> >>> (following
> >>>>>  the JobResourceRequirements pattern) and automatically restored
> >>> after
> >>>>>  failover.
> >>>>>
> >>>>> For more details, please refer to the FLIP [1].
> >>>>>
> >>>>> Looking forward to your feedback and suggestions!
> >>>>>
> >>>>> [1]
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
> >>>>>
> >>>>> Best regards,
> >>>>> Jiangang Liu
> >>>>
> >>>
> >>
>
>

Re: [DISCUSS] FLIP-571: Support Dynamically Updating Checkpoint Configuration at Runtime via REST API

Reply via email to