Hey everyone,

I would like to start a discussion about FLIP-547: Support checkpoint
during recovery [1].

Currently, when a Flink job recovers from an unaligned checkpoint, it
cannot trigger a new checkpoint until the entire recovery process is
complete. For state-heavy or computationally intensive jobs, this recovery
phase can be very slow, sometimes lasting for hours.

This limitation introduces significant challenges. It can block upstream
and downstream systems, and any interruption (like another failure or a
rescaling event) during this long recovery period causes the job to lose
all progress and revert to the last successful checkpoint. This severely
impacts the reliability and operational efficiency of long-running,
large-scale jobs.

This proposal aims to solve these problems by allowing checkpoints to be
taken *during* the recovery phase. This would allow a job to periodically
save its restored progress, making the recovery process itself
fault-tolerant. Adopting this feature will make Flink more robust, improve
reliability for demanding workloads, and strengthen processing guarantees
like exactly-once semantics.
Looking forward to feedback!

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery

Best,
Rui

Reply via email to