Hi everyone,

I would like to start a discussion on FLIP-571: Support Dynamically
Updating Checkpoint Configuration at Runtime via REST API [1].

Currently, checkpoint configuration (checkpointInterval, checkpointTimeout)
is immutable after job submission. This creates significant operational
challenges for long-running streaming jobs:

   1. Cascading checkpoint failures cannot be resolved without restarting
   the
   job, causing data reprocessing delays.
   2. Near-complete checkpoints (e.g., 95% persisted) are entirely discarded
   on timeout — wasting all I/O work and potentially creating a failure
   loop for large-state jobs.
   3. Static configuration cannot adapt to variable workloads at runtime.

FLIP-571 proposes a new REST API endpoint:

PATCH /jobs/:jobid/checkpoints/configuration

Key design points:

   - Timeout changes apply immediately to in-flight checkpoints by
   rescheduling their canceller timers, saving near-complete checkpoints
   from being discarded.
   - Interval changes take effect on the next checkpoint trigger cycle.
   - Configuration overrides are persisted to ExecutionPlanStore (following
   the JobResourceRequirements pattern) and automatically restored after
   failover.

For more details, please refer to the FLIP [1].

Looking forward to your feedback and suggestions!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-571%3A+Support+Dynamically+Updating+Checkpoint+Configuration+at+Runtime+via+REST+API

Best regards,
Jiangang Liu

Reply via email to