Re: [DISCUSS] FLIP-XXX: Independent Checkpoint Based On Pipeline Region

Roman Khachatryan Fri, 19 Jun 2026 07:53:07 -0700

Hi, thanks for your replies and sorry for the delay.

Most of my questions were answered, but I still have some concerns.


> If there are no further concerns by next Monday (June 22), I'll go ahead
and start the [VOTE] thread for this FLIP.

Isn't the actual FLIP still missing? I only saw Google Document. Do you
mind creating a page according to [1]?

----------------------------------------

> 3. Checkpoint metadata layout
> Regional Checkpoint recombines state from different checkpoint IDs. To
track this, we add a refCheckpointId field to OperatorSubtaskState in the
metadata, indicating which historical checkpoint a subtask’s state
references.

Could you explain how do we find the right OperatorSubtaskState -
especially in case of rescaling?
Does the proposal support rescaling?

> 9. Finished operators
> The concern is: a finished operator’s final commit notification gets
skipped by Regional Checkpoint, and if this checkpoint is the last one, the
operator never receives it — could this cause data loss?
> In practice, the impact is limited:
> ● Failed Region tasks are already gone: By the time the Regional
Checkpoint completes, tasks in the failed Region have already been
restarted (decline) or cancelled (timeout). There is no task left to
receive the notification anyway.
Checkpoint failure doesn't necessarily cause a restart (especially if this
is limited to one region). The tasks should still be up and running.

> ● maxConsecutiveFailures guarantees a global checkpoint: After reaching
the limit, the next checkpoint is forced to be global, ensuring all tasks
eventually receive notifyCheckpointComplete. We can’t skip the same Region
forever.
maxConsecutiveFailures might not be reached for the final checkpoint.

> ● stop-with-savepoint bypasses Regional Checkpoint: When the user stops
the job gracefully, it triggers a full global snapshot, not a Regional
Checkpoint. So the final checkpoint is always complete.
stop-with-savepoint should be fine, yes.

To clarify, my concern is about jobs with bounded sources. In such cases,
some subtasks might finish processing but still participate in checkpoints.
After a successful checkpoint, they are guaranteed to get checkpoint
completion notification - so that they can make side effects visible in
external systems (commit Kafka transactions).
See FLIP-147 [2]

However, with the current proposal, the job might complete with some
subtasks/regions failing the final checkpoint unless I'm missing something.
This is essentially data loss.
To prevent this, the final checkpoint must always be acked by all
subtasks/regions.

----------------------------------------

There are quite some limitations in this proposal.
Could you add a section describing how each of them is handled?
1. Reject job submission
2. Force all-region checkpoint
3. Warn in documentation

> 1. Region independence — BLOCKING/HYBRID edges
> You’re right. Our current scope is limited to embarrassingly parallel
regions. In typical ETL scenarios, each parallelism maps to an independent
Region with no edges connecting them.

> 5. SharedStateRegistry — how are old states kept alive?
> Good question. In the current design, since we only target embarrassingly
parallel regions, there is typically no keyed state and no incremental
state. As a result, the SharedStateRegistry is generally empty (setting
aside File Merging and Changelog State for now, discussed on 8.), so
keep-alive of files under the shared directory is not a concern.

> 8. FLINK-26803 and FLIP-306 compatibility
> This is a very important point. Both features essentially merge small
files at the job level. As Rui Fan pointed out, if the merging granularity
is reduced to the Region level, compatibility with Regional Checkpoint
should be achievable in theory. I think this can be deferred to future work
— once FLINK-26803 is consolidated into FLIP-306, we can revisit and enable
support.

> 10. NO_CLAIM mode warning
> You’re absolutely right — this is an important reminder. After restoring
from a Regional Checkpoint, only a successful global checkpoint guarantees
independence from the old state. We’ll add a clear user warning in the
documentation.

> 11. Changelog state backend — not supported
> As mentioned earlier, our primary target is embarrassingly parallel
regions, which typically have no keyed state and therefore no slow
incremental state flush issues. I don’t think we need to support Changelog
state backend for now.

----------------------------------------

> 2. max-consecutive-failures exceeded — what exactly happens?
> The current design says “force a global checkpoint.” To clarify the
two-tier behavior:
> ● Tier 1: When consecutiveRegionalCount >= maxConsecutiveFailures, the
next checkpoint is forced to be global.
> ● Tier 2: If that forced global checkpoint also fails (any task
declines), the checkpoint is aborted normally (not a job failure). The
counter is then reset since a global checkpoint was attempted, and the next
checkpoint cycle can try again.
> This avoids cascading into job failure while ensuring we don’t drift
indefinitely on historical state.

My assumption was that we would not allow this particular failed region to
fail the checkpoint again.
But forcing a global checkpoint works as well.

> 6. Checkpoint abort notifications & Local Recovery cleanup — new
notification type
> This is a very insightful point. Zihao and Gen also raised this in
earlier discussions. The current design doesn’t address state cleanup for
tasks in failed regions. I agree it’s necessary to introduce a new
notification type. For tasks in failed regions, local state cleanup can be
deferred until the next checkpoint trigger.
Ok, this can be some future work.

> 7. Task that never acknowledges nor declines — per-region timeouts
> This was discussed in the previous thread. Network issues may cause a
task to neither ack nor decline in time. In such cases, we treat it as a
checkpoint timeout: the affected tasks’ region is marked as failed, and the
process ultimately falls through to the normalRegional Checkpoint
processing logic.
Ok, this can be some future work.

----------------------------------------

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65145551#FlinkImprovementProposals-CreateyourOwnFLIP

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-147%3A+Support+Checkpoints+After+Tasks+Finished

Regards,
Roman

Regards,
Roman


On Wed, Jun 17, 2026 at 9:56 AM 熊饶饶 <[email protected]> wrote:

> Hi all,
>
> Thanks everyone for the valuable feedback. I believe all the points raised
> above have been addressed (@Roman @Rui Fan). If there are no further
> concerns  by next Monday (June 22), I'll go ahead and start the [VOTE]
> thread for this FLIP.
>
> For reference, the earlier related discussion can be found here:
> https://lists.apache.org/thread/qpztk0jdpcmhomszjx63l53xv26xnmwf
>
>
> Please feel free to share any additional feedback before then.
>
> Best Regards,
> Raorao
>
> 2026年5月27日 16:31，熊饶饶 <[email protected]> 写道：
>
> Hi devs,
>
> I would like to start a discussion on FLIP-XXX: Independent Checkpoint
> Based On Pipeline Region.
>
> In high-parallelism streaming jobs, a single Task's checkpoint failure
> causes the entire global Checkpoint to abort, leading to degraded
> checkpoint success rates and wasted compute resources (especially for GPU
> operators).
>
> We propose Regional Checkpoint: when some Regions fail to checkpoint, the
> framework combines the historical state of the failed Regions with the
> current state of the healthy Regions to produce a logically complete
> Completed Checkpoint — while preserving state consistency. The key changes
> are:
>
> 1. Snapshot Collection — Allow partial region failures; combine last
> successful state of failed Regions with current state of normal Regions.
>
> 2. State Correction — New checkpointCoordinatorForRegionFallback interface
> for OperatorCoordinators to produce consistent snapshots against the mixed
> view.
>
> 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent
> premature cleanup of referenced historical checkpoints.
>
> The detailed design is described in the FLIP document:
>
> https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing
>
> Looking forward to your feedback!
>
> Best regards,
>
> Raorao Xiong
>
>
>

Re: [DISCUSS] FLIP-XXX: Independent Checkpoint Based On Pipeline Region

Reply via email to