Re: [DISCUSS] CIP-21: Support Flink job recovery from JobManager failure for Apache Celeborn

Nicholas Jiang Mon, 28 Jul 2025 02:04:01 -0700

Hi Xu Huang,

Thanks for driving CIP-21 to support Flink job recovery mechanism of JM. I have 
some questions about this proposal:

1. The OperationLog interface is introduced to represent various operations 
within Celeborn. Could we reuse the JobEvent interface instead of introducing 
OperationLog interface?

2. RecoverableStore is an interface that supports asynchronous writing and 
sequential reading of OperationLog to reliable storage. Could we reuse 
JobEventStore introduced in FLIP-383? IMO, it does not need to introduce 
RecoverableStore in celeborn.

3. There are some interfaces introduced in FLIP-383 including 
ShuffleMaster#getAllPartitionWithMetrics. Should RemoteShuffleMaster implement 
introduced interfaces to support? The proposed changes does not mention this 
point?

4. How does this support guarantee status compatibility during the upgrade 
process?

Regards,
Nicholas Jiang

On 2025/07/28 07:25:00 Xu Huang wrote:
> Hi community,
> 
> I’d like to initiate a discussion regarding CIP-21: Support Flink job
> recovery from JobManager failure for Apache Celeborn [1].
> 
> This proposal aims to enable Celeborn to support Flink’s batch job recovery
> feature [2]. With this enhancement, Flink batch jobs using Celeborn will be
> able to recover from previously completed stages after a JobManager
> failure, eliminating the need to restart the entire job from scratch.
> 
> Your feedback and questions are welcome — please feel free to share any
> thoughts you may have.
> 
> Best regards,
> Xu Huang
> 
> [1] CIP-21: Support flink jobs recovery from JobManager failure for Apache
> Celeborn. https://cwiki.apache.org/confluence/x/kw9JFg
> [2] FLIP-383: Support Job Recovery from JobMaster Failures for Batch Jobs.
> https://cwiki.apache.org/confluence/x/QwqZE
>

Re: [DISCUSS] CIP-21: Support Flink job recovery from JobManager failure for Apache Celeborn

Reply via email to