Hi Jiangang Liu, thanks for the proposal!
I have some questions about it:
1. Observability
IIUC, updating CheckpointCoordinator state, doesn't update any config
objects and therefore is not reflected in the Web UI / REST API. That would
make it very confusing for the users.
Implementing GET by the new endpoint would partially mitigate it; but then
we need to think about consistency between GET and PUT.
2. Validation and configuration adjustment
Currently, when a job is started, checkpointing configuration is built and
validated as a whole.
For example, checking feature compatibility with Unaligned Checkpoints, or
something like
if (checkpointInterval < minPauseBetweenCheckpoints) {
checkpointInterval = minPauseBetweenCheckpoints;
}
With a partial configuration update, we skip such validation/update logic.
We can solve it by calling this validation from the update path; but that's
error prone and adds complexity.
Furthermore, we need to enforce that logic on recovery as well (which is
doable, but again adds complexity).
3. Relation to FLIP-309
Will we need any special handling when processing backlog?
4. Besides that, FLIP-309 left config fields final in CheckpointCoordinator
but introduced some complexity to it.
If we're to make these fields non-final then it might be a good opportunity
to get rid of that complexity.
5. Would the new API allow to turn on/off checkpointing completely? (e.g.
for debugging or catching up with a backlog)
6. Volatility
> Phase 1 focuses on two parameters: checkpointInterval and
checkpointTimeout. Both fields in CheckpointCoordinator change from final
to volatile.
Why do we need to mark those fields volatile? Isn't the following enough?
> All operations are within the existing synchronized(lock) block
7. API conflicts
When can we get "Concurrent API calls: Second request rejected with 409"
situation?
8. API path
When thinking about API as a whole, placing the new endpoint under
"/jobs/{job_id]/config/checkpointing" seems more clear to me.
9. Given the complexity this proposal adds to the CheckpointCoordinator
and/or concerns 1 and 2;
I wonder if it's actually worth it, given FLIP-530.
The proposal mentions the following advantages over FLIP-530:
1. No job restarts. From my experience, restarts without rescaling are
usually fast because they rely on Local Recovery
2. Allowing checkpoints to complete. This can be achieved more generically
by letting Scheduler to "cool down" after receiving configuration - similar
to receiving ResourceRequirements.
3. Works with DefaultScheduler. That's true, however AdaptiveScheduler is
planned to be the default one.
>From my perspective, the remaining advantages don't justify the added
complexity.
Or maybe we can start with FLIP-530 and add FLIP-571 on top if that's not
enough. That would also simplify coordination (I had a lot of conflicts
while working on FLIP-530 internal version).
AFAIK, the work on the FLIP-530 public version has already started
cc: @Matthias Pohl <[email protected]>, Anton, Zsombor
Regards,
Roman
On Wed, Apr 8, 2026 at 10:52 AM zhao_abc_123 <[email protected]> wrote:
> Thanks for the flip,This ability is needed in production.I would like to
> add a suggestion,The dynamic adjustment of these configuration parameters
> is also
> helpful.`execution.checkpointing.max-concurrent-checkpoints``execution.checkpointing.min-pause``execution.checkpointing.tolerable-failed-checkpoints`
>
>
> Best
> xingsuo-zbz
>
> At 2026-04-08 15:51:08, "熊饶饶" <[email protected]> wrote:
> >Thanks for the flip. It is useful for users. I have only one question: JM
> Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in
> Large-Scale Jobs?
> >
> >> 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道:
> >>
> >> Hi everyone,
> >>
> >> I would like to start a discussion on FLIP-571: Support Dynamically
> >> Updating Checkpoint Configuration at Runtime via REST API [1].
> >>
> >> Currently, checkpoint configuration (checkpointInterval,
> checkpointTimeout)
> >> is immutable after job submission. This creates significant operational
> >> challenges for long-running streaming jobs:
> >>
> >> 1. Cascading checkpoint failures cannot be resolved without restarting
> >> the
> >> job, causing data reprocessing delays.
> >> 2. Near-complete checkpoints (e.g., 95% persisted) are entirely
> discarded
> >> on timeout — wasting all I/O work and potentially creating a failure
> >> loop for large-state jobs.
> >> 3. Static configuration cannot adapt to variable workloads at runtime.
> >>
> >> FLIP-571 proposes a new REST API endpoint:
> >>
> >> PATCH /jobs/:jobid/checkpoints/configuration
> >>
> >> Key design points:
> >>
> >> - Timeout changes apply immediately to in-flight checkpoints by
> >> rescheduling their canceller timers, saving near-complete checkpoints
> >> from being discarded.
> >> - Interval changes take effect on the next checkpoint trigger cycle.
> >> - Configuration overrides are persisted to ExecutionPlanStore
> (following
> >> the JobResourceRequirements pattern) and automatically restored after
> >> failover.
> >>
> >> For more details, please refer to the FLIP [1].
> >>
> >> Looking forward to your feedback and suggestions!
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API
> >>
> >> Best regards,
> >> Jiangang Liu
>