Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Piotr Nowojski Mon, 22 Sep 2025 02:58:58 -0700

Hi,

Thanks, also +1 from me for working on this. That's one of the biggest
remaining feature gaps in the reliable checkpointing story in Flink.
Also +1 for the approach proposed in this FLIP.


Best,
Piotrek

pon., 22 wrz 2025 o 10:10 Rui Fan <[email protected]> napisał(a):

> Thanks everyone for the feedback!
>
> I would start vote tomorrow if no more comment, thanks
>
> Best,
> Rui
>
> On Wed, Sep 17, 2025 at 8:09 AM Zakelly Lan <[email protected]> wrote:
>
> > Hi Rui,
> >
> > It's a nice addition, and +1 for this optimization. I read through the
> > design and have no questions.
> >
> > Thanks for driving this.
> >
> >
> > Best,
> > Zakelly
> >
> > On Tue, Sep 16, 2025 at 9:15 PM Gabor Somogyi <[email protected]
> >
> > wrote:
> >
> > > I've played a bit with the mentioned 2 scenarios and I agree with with
> > you.
> > > Namely I also don't expect unmanageable additional disk requirements
> with
> > > this addition.
> > > Later if we see something we still have the possibility to add some
> > limits.
> > >
> > > +1 from my side.
> > >
> > > BR,
> > > G
> > >
> > >
> > > On Fri, Sep 12, 2025 at 10:48 AM Rui Fan <[email protected]> wrote:
> > >
> > > > Hey Gabor, thanks for your attention and discussion!
> > > >
> > > > > We see that couple of workloads require heavy disk usage already.
> Are
> > > > > there any numbers what additional spilling would mean when buffers
> > > > > exhausted?
> > > > > Some sort of ratio would be also good.
> > > >
> > > > My primary assessment is that the volume of "channel state" data
> being
> > > > spilled to disk should generally not be excessive. This is because
> this
> > > > state originates entirely from in-memory network buffers, and the
> total
> > > > available disk capacity is typically far greater than the total size
> of
> > > > these
> > > > memory buffers.
> > > >
> > > > As I see it, there are two main scenarios that could trigger
> spilling:
> > > >
> > > > Scenario 1: Scaling Down Parallelism
> > > >
> > > > For example, if parallelism is reduced from 100 to 1. The old job
> > > > (with 100 instances) might have a large amount of state held in its
> > > > network buffers. The new, scaled-down job (with 1 instance) has
> > > > significantly less memory allocated for network buffers, which could
> > > > be insufficient to hold the state during recovery, thus causing a
> > spill.
> > > >
> > > > However, I believe this scenario is unlikely in practice. A large
> > amount
> > > > of channel state(is snapshotted by unaligned checkpoint) usually
> > > > indicates high backpressure, and the correct operational response
> > > > would be to scale up, not down. Scaling up would provide more network
> > > > buffer memory, which would prevent, rather than cause, spilling.
> > > >
> > > > Scenario 2: All recovered buffers are restored on the input side
> > > >
> > > > This is a more plausible scenario. Even if the parallelism is
> > unchanged,
> > > > a task's input buffer pool might need to accommodate both its own
> > > > recovered input state and the recovered output state from upstream
> > > > tasks. The combined size of this data could exceed the input pool's
> > > > capacity and trigger spilling.
> > > >
> > > > > Is it planned to opt for slower memory-only recovery after a
> declared
> > > > > maximum disk usage exceeded? I can imagine situations where
> > > > > memory and disk filled quickly which will blow things up and stays
> in
> > > an
> > > > > infinite loop (huge state + rescale).
> > > >
> > > > Regarding your question about a fallback plan for when disk usage
> > exceeds
> > > > its limit: currently, we do not have such a "slower" memory-only plan
> > in
> > > > place.
> > > >
> > > > The main reason is consistent with the point above: we believe the
> risk
> > > of
> > > > filling the disk is manageable, as the disk capacity is generally
> much
> > > > larger
> > > > than the potential volume of data from the in-memory network buffers.
> > > >
> > > > However, I completely agree with your suggestion. Implementing such a
> > > > safety valve would be a valuable addition for the future. We will
> > monitor
> > > > for related issues, and if they arise, we'll prioritize this
> > enhancement
> > > in
> > > > the
> > > > future.
> > > >
> > > > WDYT?
> > > >
> > > > Best,
> > > > Rui
> > > >
> > > > On Thu, Sep 11, 2025 at 11:07 PM Roman Khachatryan <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi Rui, thanks for driving this!
> > > > >
> > > > > This would be a very useful addition to the Unaligned Checkpoints.
> > > > >
> > > > > I have no comments on the proposal as we already discussed it
> > offline,
> > > > > Looking forward to it being implemented and released!
> > > > >
> > > > > Regards,
> > > > > Roman
> > > > >
> > > > >
> > > > > On Thu, Sep 11, 2025 at 3:52 PM Gabor Somogyi <
> > > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Rui,
> > > > > >
> > > > > > The proposal describes the problem and plan in a detailed way, +1
> > on
> > > > > > addressing this. I've couple of questions:
> > > > > > - We see that couple of workloads require heavy disk usage
> already.
> > > Are
> > > > > > there any numbers what additional spilling would mean when
> buffers
> > > > > > exhausted?
> > > > > > Some sort of ratio would be also good.
> > > > > > - Is it planned to opt for slower memory-only recovery after a
> > > declared
> > > > > > maximum disk usage exceeded? I can imagine situations where
> > > > > > memory and disk filled quickly which will blow things up and
> stays
> > in
> > > > an
> > > > > > infinite loop (huge state + rescale).
> > > > > >
> > > > > > BR,
> > > > > > G
> > > > > >
> > > > > >
> > > > > > On Thu, Sep 11, 2025 at 12:34 PM Rui Fan <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hey everyone,
> > > > > > >
> > > > > > > I would like to start a discussion about FLIP-547: Support
> > > checkpoint
> > > > > > > during recovery [1].
> > > > > > >
> > > > > > > Currently, when a Flink job recovers from an unaligned
> > checkpoint,
> > > it
> > > > > > > cannot trigger a new checkpoint until the entire recovery
> process
> > > is
> > > > > > > complete. For state-heavy or computationally intensive jobs,
> this
> > > > > > recovery
> > > > > > > phase can be very slow, sometimes lasting for hours.
> > > > > > >
> > > > > > > This limitation introduces significant challenges. It can block
> > > > > upstream
> > > > > > > and downstream systems, and any interruption (like another
> > failure
> > > > or a
> > > > > > > rescaling event) during this long recovery period causes the
> job
> > to
> > > > > lose
> > > > > > > all progress and revert to the last successful checkpoint. This
> > > > > severely
> > > > > > > impacts the reliability and operational efficiency of
> > long-running,
> > > > > > > large-scale jobs.
> > > > > > >
> > > > > > > This proposal aims to solve these problems by allowing
> > checkpoints
> > > to
> > > > > be
> > > > > > > taken *during* the recovery phase. This would allow a job to
> > > > > periodically
> > > > > > > save its restored progress, making the recovery process itself
> > > > > > > fault-tolerant. Adopting this feature will make Flink more
> > robust,
> > > > > > improve
> > > > > > > reliability for demanding workloads, and strengthen processing
> > > > > guarantees
> > > > > > > like exactly-once semantics.
> > > > > > > Looking forward to feedback!
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-547%3A+Support+checkpoint+during+recovery
> > > > > > >
> > > > > > > Best,
> > > > > > > Rui
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-547: Support checkpoint during recovery

Reply via email to