Re: [DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Han Yin Wed, 26 Mar 2025 01:30:17 -0700

Hi Yanfei,
Thanks for your response!
1. Yeah. The feature has been evaluated on a Flink job with roughly 200 GB 
state size. The feature reduces the average checkpoint duration from dozens of 
seconds to within seconds. And the recovery duration is reduced from several 
minutes to about ten seconds.
2. Yes. As long as the job is restored in CLAIM mode and ForSt’s work dir 
resides within the checkpoint shared dir, the checkpoint files will be reused 
instead of being copied.
3. The only overhead is maintaining file metadata in a hash map, which is 
light-weight and negligible compared to the cost of checkpoint&restoration.


> 2025年3月19日 16:36，Yanfei Lei <[email protected]> 写道：
> 
> Hi Han,
> 
> Thanks for the proposal.
> Faster Checkpoint & Recovery lays the groundwork for Disaggregated
> State to adapt to cloud-native deployment. Regarding the FLIP, I have
> three comments:
> 
> 1. Are there any preliminary evaluation results available for this feature?
> 2. In terms of compatibility, can this feature be enabled using an
> existing original checkpoint or a native savepoint?
> 3. Does this feature introduce any additional overhead?
> 
> Han Yin <[email protected]> 于2025年2月21日周五 19:00写道：
> 
>> 
>> Hi Zakelly,
>> Thanks for your response!
>> 1. Sure. I’ve added a Section called ‘End-to-end user case’ after the 
>> section ‘Overview’.
>> 2. Yes, because reusing files somewhat goes against the semantics of a full 
>> checkpoint. If full-checkpoint is enforced, the FileTransferStrategy will 
>> enforce the files to be transferred by copying instead by reusing.
>> 3. Yes. The changes happen all under  the ForStStateBackend. I’ve updated 
>> the Section in the FLIP.
>> 4. In fact, we don't need much special file handling for checkpoint 
>> failures, as they are managed by ForSt’s snapshot strategy. The proposed 
>> FileTransferStrategy only checks whether the files are successfully 
>> transferred. If the transfer is unsuccessful, it throws an exception, 
>> ultimately failing the checkpoint.  If the transfer succeed but the 
>> checkpoint is aborted, since the file is already 'uploaded' to the 
>> checkpoint directory, it is no longer owned by the DB, and the snapshot 
>> strategy will skip re-uploading it for subsequent checkpoints.
>> 
>>> 2025年2月17日 11:44，Zakelly Lan <[email protected]> 写道：
>>> 
>>> Hi Han,
>>> 
>>> Thanks for driving this!
>>> 
>>> The FLIP is in good shape, here are my comments:
>>> 
>>> 1. The FLIP introduces the file reusing during snapshot and recovery. Could
>>> you please provide some common use cases from the user's perspective? e.g.
>>> Periodic checkpoint, native savepoint.
>>> 2. Does the current design depend on the incremental checkpoint? If we
>>> enforce the full checkpoint, then what happened?
>>> 3. Will all the proposed changes be under the ForStStateBackend? It is
>>> better to emphasize this in 'Proposed Changes'
>>> 4. Is there any special file handling for checkpoint failure?
>>> 
>>> 
>>> Best,
>>> Zakelly
>>> 
>>> 
>>> On Fri, Feb 14, 2025 at 6:35 PM Han Yin <[email protected]> wrote:
>>> 
>>>> Hi everyone,
>>>> 
>>>> I would like to open a discussion on implementing faster checkpoint &
>>>> recovery for disaggregated state[1].
>>>> 
>>>> This is an improvement work for the disaggregated state management ForSt,
>>>> so you may want to read FLIP-423[2] and FLIP-428[3] to know the 
>>>> backgrounds.
>>>> 
>>>> Currently, ForSt copies or fast-duplicates files between the working
>>>> directory and the checkpoint directory during checkpointing and
>>>> restoration. However, in a disaggregated environment, there is no need to
>>>> maintain multiple copies of files since they typically reside within the
>>>> same remote file system. Therefore, we propose an approach for reusing
>>>> files when ForSt generates snapshots or restores from checkpoints and for
>>>> managing the file ownership between Flink & ForSt. By eliminating the
>>>> overhead of file copying, checkpointing & restoration & rescaling can
>>>> become significantly faster for disaggregated state.
>>>> 
>>>> Looking forward to your comments or feedback.  Best regards,
>>>> Han Yin
>>>> 
>>>> [1]
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898
>>>> <
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898
>>>>> 
>>>> [2]
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855
>>>> <
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855
>>>>> 
>>>> [3]
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865
>>>> <
>>>> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>> 
> 
> 
> --
> Best,
> Yanfei

Re: [DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Reply via email to