Thanks, Look_Y_Yd. These are reasonable architectural concerns, but the design deliberately accepts these trade-offs for good reasons:
1. *Proven pattern.* JobResourceRequirements already uses exactly this approach ($internal.job-resource-requirements inside ExecutionPlan). The pattern has been battle-tested in production across the adaptive scheduler and dynamic rescaling. Introducing a separate storage path would add operational complexity (new ZK/K8s paths, new cleanup logic, new recovery code paths) with minimal benefit. 2. *Write amplification is bounded.* Dynamic checkpoint updates are rare operations (SRE-initiated, not automated). A few extra writes per hour to ZooKeeper is negligible compared to the checkpoint metadata writes that already happen every interval. If future parameters make updates more frequent, a dedicated lightweight store can be introduced. 3. *Debuggability.* The GET /jobs/:jobid/checkpoints/config endpoint explicitly surfaces effective values. Additionally, the $internal. prefix convention clearly marks the field as a runtime override, distinguishing it from user-submitted configuration. Look_Y_Y <[email protected]> 于2026年4月15日周三 18:13写道: > Thanks for the FLIP. I like the dynamic way to control flink. But I am > confused that why reuse `ExecutionPlan.getJobConfiguration()` instead of a > dedicated storage path? What about storage bloat risk? > > The FLIP proposes persisting `JobCheckpointingOverrides` inside > `ExecutionPlan.getJobConfiguration()` using the key > `$internal.job-checkpoint-overrides`. This piggybacks on the existing > `ExecutionPlan` blob in ZooKeeper/Kubernetes ConfigMap. Two concerns arise: > > 1. Coupling risk. Embedding runtime overrides inside the ExecutionPlan > blurs the boundary between job definition (immutable after submission) and > runtime state (mutable). This could cause confusion when debugging — the > ExecutionPlan retrieved from the store may differ from the originally > submitted plan. > 2. Size and write amplification. Every dynamic update triggers a full > `ExecutionPlan` re-serialization and write. For jobs with large execution > plans (thousands of operators), this is a heavyweight operation for > changing two numbers. > > > 2026年4月15日 15:32,Jiangang Liu <[email protected]> 写道: > > > > Thanks for Verne Deng's valuable question. This is a legitimate > operational > > concern, but it is an inherent trade-off of any runtime tuning > capability, > > not a flaw specific to this design: > > > > 1. *The status quo is worse.* Today, the only option is to restart the > > entire job, which causes data reprocessing and downstream impact. > Extending > > the timeout is strictly less disruptive than a restart, even if the > > checkpoint ultimately fails. > > 2. *Observability is preserved.* The existing checkpoint metrics ( > > checkpointDuration, checkpointSize, per-subtask completion times) > remain > > fully available. The FLIP does not suppress any signals — it only gives > > operators more time. > > 3. *Guardrails can be added incrementally.* Future iterations can > > introduce maximum timeout bounds or automatic alerting when dynamic > > overrides are active. This FLIP explicitly scopes Phase 1 to the > mechanism; > > policy is a separate concern. > > > > The documentation and release notes should include best-practice > guidance: > > use dynamic timeout extension as a *temporary bridge*, not a permanent > > workaround. > > > > Jiangang Liu <[email protected]> 于2026年4月15日周三 15:20写道: > > > >> Thanks, xiongraorao. This is a valid theoretical concern, but in > practice > >> the risk is mitigated by the existing design: > >> > >> 1. The PATCH handler forwards updates to CheckpointCoordinator on the > *JobMaster > >> main thread*. Both fields are written in a single method invocation > >> within synchronized(lock), so any reader that also holds the lock sees > >> both updates atomically. > >> 2. The volatile keyword is primarily a safety net for unsynchronized > >> reads (e.g., metrics reporting or logging). The critical scheduling > and > >> canceller logic all operates within synchronized(lock). > >> 3. Even in the worst case of a transient inconsistent read, the next > >> periodic trigger cycle (seconds later) will observe both correct > values. > >> There is no persistent corruption. > >> > >> > >> Verne Deng <[email protected]> 于2026年4月15日周三 15:14写道: > >> > >>> Thanks for the design. I think it is a common requirement in our > >>> production. I have one question——Could dynamic timeout extension mask > >>> genuine checkpoint hangs, making problems harder to detect? > >>> > >>> The primary motivation is allowing SREs to extend `checkpointTimeout` > >>> to save near-complete checkpoints. However, this introduces an > >>> operational anti-pattern risk: operators might habitually extend > >>> timeouts instead of investigating root causes (e.g., state backend > >>> degradation, skewed key distribution, slow sinks). A checkpoint > >>> "stuck" at 95% might actually indicate a genuine hang in one subtask, > >>> and extending the timeout only delays the inevitable failure while > >>> consuming additional resources (holding barriers, buffering data). > >>> > >>> This could turn a clear, fast-failing signal into a slow, ambiguous > >>> one — exactly the opposite of what good observability requires. > >>> > >>> 熊饶饶 <[email protected]> 于2026年4月15日周三 15:03写道: > >>>> > >>>> > >>>> It is a very useful feature in the production. Once a checkpoint > fails, > >>> the job may stuck and the next checkpoint may fail without updating the > >>> config. The only thing I care is the thread Safety - will `volatile` > fields > >>> cause consistency issues between `checkpointInterval` and > >>> `checkpointTimeout`? > >>>> > >>>> The FLIP proposes changing `checkpointInterval` and > `checkpointTimeout` > >>> from `final` to `volatile` in `CheckpointCoordinator`. While `volatile` > >>> guarantees visibility, it does not guarantee atomicity across multiple > >>> fields. If a user updates both values simultaneously via a single PATCH > >>> request, there is a window where `CheckpointCoordinator` could observe > the > >>> new `checkpointInterval` but the old `checkpointTimeout` (or vice > versa). > >>> This partial-update visibility could lead to unexpected behavior — for > >>> example, a shorter interval combined with the old (shorter) timeout, > >>> causing checkpoints to be triggered more frequently and immediately > timeout. > >>>> > >>>>> 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: > >>>>> > >>>>> Hi everyone, > >>>>> > >>>>> I would like to start a discussion on FLIP-571: Support Dynamically > >>>>> Updating Checkpoint Configuration at Runtime via REST API [1]. > >>>>> > >>>>> Currently, checkpoint configuration (checkpointInterval, > >>> checkpointTimeout) > >>>>> is immutable after job submission. This creates significant > >>> operational > >>>>> challenges for long-running streaming jobs: > >>>>> > >>>>> 1. Cascading checkpoint failures cannot be resolved without > >>> restarting > >>>>> the > >>>>> job, causing data reprocessing delays. > >>>>> 2. Near-complete checkpoints (e.g., 95% persisted) are entirely > >>> discarded > >>>>> on timeout — wasting all I/O work and potentially creating a failure > >>>>> loop for large-state jobs. > >>>>> 3. Static configuration cannot adapt to variable workloads at > >>> runtime. > >>>>> > >>>>> FLIP-571 proposes a new REST API endpoint: > >>>>> > >>>>> PATCH /jobs/:jobid/checkpoints/configuration > >>>>> > >>>>> Key design points: > >>>>> > >>>>> - Timeout changes apply immediately to in-flight checkpoints by > >>>>> rescheduling their canceller timers, saving near-complete > >>> checkpoints > >>>>> from being discarded. > >>>>> - Interval changes take effect on the next checkpoint trigger cycle. > >>>>> - Configuration overrides are persisted to ExecutionPlanStore > >>> (following > >>>>> the JobResourceRequirements pattern) and automatically restored > >>> after > >>>>> failover. > >>>>> > >>>>> For more details, please refer to the FLIP [1]. > >>>>> > >>>>> Looking forward to your feedback and suggestions! > >>>>> > >>>>> [1] > >>>>> > >>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API > >>>>> > >>>>> Best regards, > >>>>> Jiangang Liu > >>>> > >>> > >> > >
