[ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feifan Wang updated FLINK-24149:
--------------------------------
    Summary: Make checkpoint self-contained and relocatable  (was: Make 
checkpoint relocatable)

> Make checkpoint self-contained and relocatable
> ----------------------------------------------
>
>                 Key: FLINK-24149
>                 URL: https://issues.apache.org/jira/browse/FLINK-24149
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Feifan Wang
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2021-09-08-17-06-31-560.png, 
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png, 
> image-2021-09-08-18-01-03-176.png
>
>
> h3. 1. Backgroud
> FLINK-5763 proposal make savepoint relocatable, checkpoint has similar 
> requirements. For example, to migrate jobs to other HDFS clusters, although 
> it can be achieved through a savepoint, but we prefer to use persistent 
> checkpoints, especially RocksDBStateBackend incremental checkpoints have 
> better performance than savepoint during snapshot and restore.
>  
> FLINK-8531 standardized directory layout :
> {code:java}
> /user-defined-checkpoint-dir
>     |
>     + 1b080b6e710aabbef8993ab18c6de98b (job's ID)
>         |
>         + --shared/
>         + --taskowned/
>         + --chk-00001/
>         + --chk-00002/
>         + --chk-00003/
>         ...
> {code}
>  * State backend will create a subdirectory with the job's ID that will 
> contain the actual checkpoints, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/
>  * Each checkpoint individually will store all its files in a subdirectory 
> that includes the checkpoint number, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/chk-00003/
>  * Files shared between checkpoints will be stored in the shared/ directory 
> in the same parent directory as the separate checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/shared/
>  * Similar to shared files, files owned strictly by tasks will be stored in 
> the taskowned/ directory in the same parent directory as the separate 
> checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/taskowned/
> h3. Proposal
> Since the individually checkpoint directory does not contain complete state 
> data, we cannot make it relocatable, but its parent directory can. The only 
> work left is make the metadata file references relative file paths.
> I proposal make these changes to _*FsCheckpointStateOutputStream*_ :
>  * introduce _*checkpointDirectory*_ field, and remove *_allowRelativePaths_* 
> field
>  * introduce *_entropyInjecting_* field
>  * *_closeAndGetHandle()_* return _*RelativeFileStateHandle*_ with relative 
> path base on _*checkpointDirectory*_ (except entropy injecting file system)
> [~yunta], [~trohrmann] , I verified this in our environment , and submitted a 
> pull request to accomplish this feature. Please help evaluate whether it is 
> appropriate.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to