Hi all,

Thanks everyone for the valuable feedback. I believe all the points raised 
above have been addressed (@Roman @Rui Fan). If there are no further concerns  
by next Monday (June 22), I'll go ahead and start the [VOTE] thread for this 
FLIP. 

For reference, the earlier related discussion can be found here:
https://lists.apache.org/thread/qpztk0jdpcmhomszjx63l53xv26xnmwf


Please feel free to share any additional feedback before then.

Best Regards,
Raorao

> 2026年5月27日 16:31,熊饶饶 <[email protected]> 写道:
> 
> Hi devs,
> 
> I would like to start a discussion on FLIP-XXX: Independent Checkpoint Based 
> On Pipeline Region.
> 
> In high-parallelism streaming jobs, a single Task's checkpoint failure causes 
> the entire global Checkpoint to abort, leading to degraded checkpoint success 
> rates and wasted compute resources (especially for GPU operators).
> 
> We propose Regional Checkpoint: when some Regions fail to checkpoint, the 
> framework combines the historical state of the failed Regions with the 
> current state of the healthy Regions to produce a logically complete 
> Completed Checkpoint — while preserving state consistency. The key changes 
> are:
> 
> 1. Snapshot Collection — Allow partial region failures; combine last 
> successful state of failed Regions with current state of normal Regions.
> 
> 2. State Correction — New checkpointCoordinatorForRegionFallback interface 
> for OperatorCoordinators to produce consistent snapshots against the mixed 
> view.
> 
> 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent 
> premature cleanup of referenced historical checkpoints.
> 
> The detailed design is described in the FLIP document: 
> https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing
> 
> Looking forward to your feedback!
> 
> Best regards,
> 
> Raorao Xiong

Reply via email to