Hi Xu Huang, Thanks for driving CIP-21 to support Flink job recovery mechanism of JM. I have some questions about this proposal:
1. The OperationLog interface is introduced to represent various operations within Celeborn. Could we reuse the JobEvent interface instead of introducing OperationLog interface? 2. RecoverableStore is an interface that supports asynchronous writing and sequential reading of OperationLog to reliable storage. Could we reuse JobEventStore introduced in FLIP-383? IMO, it does not need to introduce RecoverableStore in celeborn. 3. There are some interfaces introduced in FLIP-383 including ShuffleMaster#getAllPartitionWithMetrics. Should RemoteShuffleMaster implement introduced interfaces to support? The proposed changes does not mention this point? 4. How does this support guarantee status compatibility during the upgrade process? Regards, Nicholas Jiang On 2025/07/28 07:25:00 Xu Huang wrote: > Hi community, > > I’d like to initiate a discussion regarding CIP-21: Support Flink job > recovery from JobManager failure for Apache Celeborn [1]. > > This proposal aims to enable Celeborn to support Flink’s batch job recovery > feature [2]. With this enhancement, Flink batch jobs using Celeborn will be > able to recover from previously completed stages after a JobManager > failure, eliminating the need to restart the entire job from scratch. > > Your feedback and questions are welcome — please feel free to share any > thoughts you may have. > > Best regards, > Xu Huang > > [1] CIP-21: Support flink jobs recovery from JobManager failure for Apache > Celeborn. https://cwiki.apache.org/confluence/x/kw9JFg > [2] FLIP-383: Support Job Recovery from JobMaster Failures for Batch Jobs. > https://cwiki.apache.org/confluence/x/QwqZE >