Thanks for the flip. It is useful for users. I have only one question: JM Memory Pressure Under High-Concurrency Sampling — Could It Cause OOM in Large-Scale Jobs?
> 2026年3月24日 16:29,Jiangang Liu <[email protected]> 写道: > > Hi everyone, > > I would like to start a discussion on FLIP-571: Support Dynamically > Updating Checkpoint Configuration at Runtime via REST API [1]. > > Currently, checkpoint configuration (checkpointInterval, checkpointTimeout) > is immutable after job submission. This creates significant operational > challenges for long-running streaming jobs: > > 1. Cascading checkpoint failures cannot be resolved without restarting > the > job, causing data reprocessing delays. > 2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded > on timeout — wasting all I/O work and potentially creating a failure > loop for large-state jobs. > 3. Static configuration cannot adapt to variable workloads at runtime. > > FLIP-571 proposes a new REST API endpoint: > > PATCH /jobs/:jobid/checkpoints/configuration > > Key design points: > > - Timeout changes apply immediately to in-flight checkpoints by > rescheduling their canceller timers, saving near-complete checkpoints > from being discarded. > - Interval changes take effect on the next checkpoint trigger cycle. > - Configuration overrides are persisted to ExecutionPlanStore (following > the JobResourceRequirements pattern) and automatically restored after > failover. > > For more details, please refer to the FLIP [1]. > > Looking forward to your feedback and suggestions! > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API > > Best regards, > Jiangang Liu
