Hey everyone, I would like to start a discussion about FLIP-547: Support checkpoint during recovery [1].
Currently, when a Flink job recovers from an unaligned checkpoint, it cannot trigger a new checkpoint until the entire recovery process is complete. For state-heavy or computationally intensive jobs, this recovery phase can be very slow, sometimes lasting for hours. This limitation introduces significant challenges. It can block upstream and downstream systems, and any interruption (like another failure or a rescaling event) during this long recovery period causes the job to lose all progress and revert to the last successful checkpoint. This severely impacts the reliability and operational efficiency of long-running, large-scale jobs. This proposal aims to solve these problems by allowing checkpoints to be taken *during* the recovery phase. This would allow a job to periodically save its restored progress, making the recovery process itself fault-tolerant. Adopting this feature will make Flink more robust, improve reliability for demanding workloads, and strengthen processing guarantees like exactly-once semantics. Looking forward to feedback! [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery Best, Rui