Hi Rui, It's a nice addition, and +1 for this optimization. I read through the design and have no questions.
Thanks for driving this. Best, Zakelly On Tue, Sep 16, 2025 at 9:15 PM Gabor Somogyi <gabor.g.somo...@gmail.com> wrote: > I've played a bit with the mentioned 2 scenarios and I agree with with you. > Namely I also don't expect unmanageable additional disk requirements with > this addition. > Later if we see something we still have the possibility to add some limits. > > +1 from my side. > > BR, > G > > > On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <1996fan...@gmail.com> wrote: > > > Hey Gabor, thanks for your attention and discussion! > > > > > We see that couple of workloads require heavy disk usage already. Are > > > there any numbers what additional spilling would mean when buffers > > > exhausted? > > > Some sort of ratio would be also good. > > > > My primary assessment is that the volume of "channel state" data being > > spilled to disk should generally not be excessive. This is because this > > state originates entirely from in-memory network buffers, and the total > > available disk capacity is typically far greater than the total size of > > these > > memory buffers. > > > > As I see it, there are two main scenarios that could trigger spilling: > > > > Scenario 1: Scaling Down Parallelism > > > > For example, if parallelism is reduced from 100 to 1. The old job > > (with 100 instances) might have a large amount of state held in its > > network buffers. The new, scaled-down job (with 1 instance) has > > significantly less memory allocated for network buffers, which could > > be insufficient to hold the state during recovery, thus causing a spill. > > > > However, I believe this scenario is unlikely in practice. A large amount > > of channel state(is snapshotted by unaligned checkpoint) usually > > indicates high backpressure, and the correct operational response > > would be to scale up, not down. Scaling up would provide more network > > buffer memory, which would prevent, rather than cause, spilling. > > > > Scenario 2: All recovered buffers are restored on the input side > > > > This is a more plausible scenario. Even if the parallelism is unchanged, > > a task's input buffer pool might need to accommodate both its own > > recovered input state and the recovered output state from upstream > > tasks. The combined size of this data could exceed the input pool's > > capacity and trigger spilling. > > > > > Is it planned to opt for slower memory-only recovery after a declared > > > maximum disk usage exceeded? I can imagine situations where > > > memory and disk filled quickly which will blow things up and stays in > an > > > infinite loop (huge state + rescale). > > > > Regarding your question about a fallback plan for when disk usage exceeds > > its limit: currently, we do not have such a "slower" memory-only plan in > > place. > > > > The main reason is consistent with the point above: we believe the risk > of > > filling the disk is manageable, as the disk capacity is generally much > > larger > > than the potential volume of data from the in-memory network buffers. > > > > However, I completely agree with your suggestion. Implementing such a > > safety valve would be a valuable addition for the future. We will monitor > > for related issues, and if they arise, we'll prioritize this enhancement > in > > the > > future. > > > > WDYT? > > > > Best, > > Rui > > > > On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <ro...@apache.org> > > wrote: > > > > > Hi Rui, thanks for driving this! > > > > > > This would be a very useful addition to the Unaligned Checkpoints. > > > > > > I have no comments on the proposal as we already discussed it offline, > > > Looking forward to it being implemented and released! > > > > > > Regards, > > > Roman > > > > > > > > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi < > gabor.g.somo...@gmail.com > > > > > > wrote: > > > > > > > Hi Rui, > > > > > > > > The proposal describes the problem and plan in a detailed way, +1 on > > > > addressing this. I've couple of questions: > > > > - We see that couple of workloads require heavy disk usage already. > Are > > > > there any numbers what additional spilling would mean when buffers > > > > exhausted? > > > > Some sort of ratio would be also good. > > > > - Is it planned to opt for slower memory-only recovery after a > declared > > > > maximum disk usage exceeded? I can imagine situations where > > > > memory and disk filled quickly which will blow things up and stays in > > an > > > > infinite loop (huge state + rescale). > > > > > > > > BR, > > > > G > > > > > > > > > > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <1996fan...@gmail.com> > wrote: > > > > > > > > > Hey everyone, > > > > > > > > > > I would like to start a discussion about FLIP-547: Support > checkpoint > > > > > during recovery [1]. > > > > > > > > > > Currently, when a Flink job recovers from an unaligned checkpoint, > it > > > > > cannot trigger a new checkpoint until the entire recovery process > is > > > > > complete. For state-heavy or computationally intensive jobs, this > > > > recovery > > > > > phase can be very slow, sometimes lasting for hours. > > > > > > > > > > This limitation introduces significant challenges. It can block > > > upstream > > > > > and downstream systems, and any interruption (like another failure > > or a > > > > > rescaling event) during this long recovery period causes the job to > > > lose > > > > > all progress and revert to the last successful checkpoint. This > > > severely > > > > > impacts the reliability and operational efficiency of long-running, > > > > > large-scale jobs. > > > > > > > > > > This proposal aims to solve these problems by allowing checkpoints > to > > > be > > > > > taken *during* the recovery phase. This would allow a job to > > > periodically > > > > > save its restored progress, making the recovery process itself > > > > > fault-tolerant. Adopting this feature will make Flink more robust, > > > > improve > > > > > reliability for demanding workloads, and strengthen processing > > > guarantees > > > > > like exactly-once semantics. > > > > > Looking forward to feedback! > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery > > > > > > > > > > Best, > > > > > Rui > > > > > > > > > > > > > > >