Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Zakelly Lan Tue, 16 Sep 2025 23:09:11 -0700

Hi Rui,

It's a nice addition, and +1 for this optimization. I read through the
design and have no questions.


Thanks for driving this.


Best,
Zakelly

On Tue, Sep 16, 2025 at 9:15 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
wrote:

> I've played a bit with the mentioned 2 scenarios and I agree with with you.
> Namely I also don't expect unmanageable additional disk requirements with
> this addition.
> Later if we see something we still have the possibility to add some limits.
>
> +1 from my side.
>
> BR,
> G
>
>
> On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <1996fan...@gmail.com> wrote:
>
> > Hey Gabor, thanks for your attention and discussion!
> >
> > > We see that couple of workloads require heavy disk usage already. Are
> > > there any numbers what additional spilling would mean when buffers
> > > exhausted?
> > > Some sort of ratio would be also good.
> >
> > My primary assessment is that the volume of "channel state" data being
> > spilled to disk should generally not be excessive. This is because this
> > state originates entirely from in-memory network buffers, and the total
> > available disk capacity is typically far greater than the total size of
> > these
> > memory buffers.
> >
> > As I see it, there are two main scenarios that could trigger spilling:
> >
> > Scenario 1: Scaling Down Parallelism
> >
> > For example, if parallelism is reduced from 100 to 1. The old job
> > (with 100 instances) might have a large amount of state held in its
> > network buffers. The new, scaled-down job (with 1 instance) has
> > significantly less memory allocated for network buffers, which could
> > be insufficient to hold the state during recovery, thus causing a spill.
> >
> > However, I believe this scenario is unlikely in practice. A large amount
> > of channel state(is snapshotted by unaligned checkpoint) usually
> > indicates high backpressure, and the correct operational response
> > would be to scale up, not down. Scaling up would provide more network
> > buffer memory, which would prevent, rather than cause, spilling.
> >
> > Scenario 2: All recovered buffers are restored on the input side
> >
> > This is a more plausible scenario. Even if the parallelism is unchanged,
> > a task's input buffer pool might need to accommodate both its own
> > recovered input state and the recovered output state from upstream
> > tasks. The combined size of this data could exceed the input pool's
> > capacity and trigger spilling.
> >
> > > Is it planned to opt for slower memory-only recovery after a declared
> > > maximum disk usage exceeded? I can imagine situations where
> > > memory and disk filled quickly which will blow things up and stays in
> an
> > > infinite loop (huge state + rescale).
> >
> > Regarding your question about a fallback plan for when disk usage exceeds
> > its limit: currently, we do not have such a "slower" memory-only plan in
> > place.
> >
> > The main reason is consistent with the point above: we believe the risk
> of
> > filling the disk is manageable, as the disk capacity is generally much
> > larger
> > than the potential volume of data from the in-memory network buffers.
> >
> > However, I completely agree with your suggestion. Implementing such a
> > safety valve would be a valuable addition for the future. We will monitor
> > for related issues, and if they arise, we'll prioritize this enhancement
> in
> > the
> > future.
> >
> > WDYT?
> >
> > Best,
> > Rui
> >
> > On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <ro...@apache.org>
> > wrote:
> >
> > > Hi Rui, thanks for driving this!
> > >
> > > This would be a very useful addition to the Unaligned Checkpoints.
> > >
> > > I have no comments on the proposal as we already discussed it offline,
> > > Looking forward to it being implemented and released!
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <
> gabor.g.somo...@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi Rui,
> > > >
> > > > The proposal describes the problem and plan in a detailed way, +1 on
> > > > addressing this. I've couple of questions:
> > > > - We see that couple of workloads require heavy disk usage already.
> Are
> > > > there any numbers what additional spilling would mean when buffers
> > > > exhausted?
> > > > Some sort of ratio would be also good.
> > > > - Is it planned to opt for slower memory-only recovery after a
> declared
> > > > maximum disk usage exceeded? I can imagine situations where
> > > > memory and disk filled quickly which will blow things up and stays in
> > an
> > > > infinite loop (huge state + rescale).
> > > >
> > > > BR,
> > > > G
> > > >
> > > >
> > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <1996fan...@gmail.com>
> wrote:
> > > >
> > > > > Hey everyone,
> > > > >
> > > > > I would like to start a discussion about FLIP-547: Support
> checkpoint
> > > > > during recovery [1].
> > > > >
> > > > > Currently, when a Flink job recovers from an unaligned checkpoint,
> it
> > > > > cannot trigger a new checkpoint until the entire recovery process
> is
> > > > > complete. For state-heavy or computationally intensive jobs, this
> > > > recovery
> > > > > phase can be very slow, sometimes lasting for hours.
> > > > >
> > > > > This limitation introduces significant challenges. It can block
> > > upstream
> > > > > and downstream systems, and any interruption (like another failure
> > or a
> > > > > rescaling event) during this long recovery period causes the job to
> > > lose
> > > > > all progress and revert to the last successful checkpoint. This
> > > severely
> > > > > impacts the reliability and operational efficiency of long-running,
> > > > > large-scale jobs.
> > > > >
> > > > > This proposal aims to solve these problems by allowing checkpoints
> to
> > > be
> > > > > taken *during* the recovery phase. This would allow a job to
> > > periodically
> > > > > save its restored progress, making the recovery process itself
> > > > > fault-tolerant. Adopting this feature will make Flink more robust,
> > > > improve
> > > > > reliability for demanding workloads, and strengthen processing
> > > guarantees
> > > > > like exactly-once semantics.
> > > > > Looking forward to feedback!
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> > > > >
> > > > > Best,
> > > > > Rui
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Reply via email to