[DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Han Yin Fri, 14 Feb 2025 02:35:46 -0800

Hi everyone,

I would like to open a discussion on implementing faster checkpoint & recovery 
for disaggregated state[1].


This is an improvement work for the disaggregated state management ForSt, so 
you may want to read FLIP-423[2] and FLIP-428[3] to know the backgrounds.

Currently, ForSt copies or fast-duplicates files between the working directory 
and the checkpoint directory during checkpointing and restoration. However, in 
a disaggregated environment, there is no need to maintain multiple copies of 
files since they typically reside within the same remote file system. 
Therefore, we propose an approach for reusing files when ForSt generates 
snapshots or restores from checkpoints and for managing the file ownership 
between Flink & ForSt. By eliminating the overhead of file copying, 
checkpointing & restoration & rescaling can become significantly faster for 
disaggregated state.

Looking forward to your comments or feedback.  Best regards,
Han Yin

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046898>
[2] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855>
[3] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865 
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046865>

[DISCUSS] FLIP-432: Faster Checkpoint & Recovery for Disaggregated State

Reply via email to