Hi all, Thanks everyone for the valuable feedback. I believe all the points raised above have been addressed (@Roman @Rui Fan). If there are no further concerns by next Monday (June 22), I'll go ahead and start the [VOTE] thread for this FLIP.
For reference, the earlier related discussion can be found here: https://lists.apache.org/thread/qpztk0jdpcmhomszjx63l53xv26xnmwf Please feel free to share any additional feedback before then. Best Regards, Raorao > 2026年5月27日 16:31,熊饶饶 <[email protected]> 写道: > > Hi devs, > > I would like to start a discussion on FLIP-XXX: Independent Checkpoint Based > On Pipeline Region. > > In high-parallelism streaming jobs, a single Task's checkpoint failure causes > the entire global Checkpoint to abort, leading to degraded checkpoint success > rates and wasted compute resources (especially for GPU operators). > > We propose Regional Checkpoint: when some Regions fail to checkpoint, the > framework combines the historical state of the failed Regions with the > current state of the healthy Regions to produce a logically complete > Completed Checkpoint — while preserving state consistency. The key changes > are: > > 1. Snapshot Collection — Allow partial region failures; combine last > successful state of failed Regions with current state of normal Regions. > > 2. State Correction — New checkpointCoordinatorForRegionFallback interface > for OperatorCoordinators to produce consistent snapshots against the mixed > view. > > 3. Checkpoint Store — Track ref_checkpoint_id in metadata to prevent > premature cleanup of referenced historical checkpoints. > > The detailed design is described in the FLIP document: > https://docs.google.com/document/d/153r9NjHN9xgFUBdZ8sNX6YjUWTREtDMv5i-JaMdE6NU/edit?usp=sharing > > Looking forward to your feedback! > > Best regards, > > Raorao Xiong
