[ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410141#comment-17410141
 ] 

Feifan Wang edited comment on FLINK-24149 at 9/5/21, 3:29 PM:
--------------------------------------------------------------

Hi [~yunta] , as far as my current understanding of checkpoint, I think this 
feature can support the three scenes you mentioned.

The main change is that FsCheckpointStreamFactory returns 
RelativeFileStateHandle instead of FileStateHandle. The difference between the 
two is that RelativeFileStateHandle holds one more field to describe the 
relative path relative to the checkpoiont exclusive directory.

Except for entropy injecting, all files written by checkpoint always have the 
same path prefix ( /user-defined-checkpoint-dir/${job_id} ). Therefore, we can 
always figure out a relative path based on checkpoint exclusive directory like 
the following:
 * ./checkpoint-file-001   (files in exclusive directory)
 * ../shared/checkpoint-file-002   (files in shared directory)
 * ../taskowned/checkpoint-file-003   (files in taskowned directory)

 


was (Author: feifan wang):
Hi [~yunta] , as far as my current understanding of checkpoint, I think this 
feature can support the three scenes you mentioned.

The main change is that FsCheckpointStreamFactory returns 
RelativeFileStateHandle instead of FileStateHandle. The difference between the 
two is that RelativeFileStateHandle holds one more field to describe the 
relative path relative to the checkpoiont exclusive directory.

Except for entropy injecting, all files written by checkpoint always have the 
same path prefix ( /user-defined-checkpoint-dir/${job_id} ). Therefore, we can 
always figure out a relative path based on checkpoint exclusive directory like 
the following:
 * ./checkpoint-file-001   (files in exclusive directory)
 * ../shared/checkpoint-file-002   (files in shared directory)
 * ../taskowned/checkpoint-file-003   (files in shared directory)

 

> Make checkpoint relocatable
> ---------------------------
>
>                 Key: FLINK-24149
>                 URL: https://issues.apache.org/jira/browse/FLINK-24149
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Feifan Wang
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Backgroud
> FLINK-5763 proposal make savepoint relocatable, checkpoint has similar 
> requirements. For example, to migrate jobs to other HDFS clusters, although 
> it can be achieved through a savepoint, but we prefer to use persistent 
> checkpoints, especially RocksDBStateBackend incremental checkpoints have 
> better performance than savepoint during snapshot and restore.
>  
> FLINK-8531 standardized directory layout :
> {code:java}
> /user-defined-checkpoint-dir
>     |
>     + 1b080b6e710aabbef8993ab18c6de98b (job's ID)
>         |
>         + --shared/
>         + --taskowned/
>         + --chk-00001/
>         + --chk-00002/
>         + --chk-00003/
>         ...
> {code}
>  * State backend will create a subdirectory with the job's ID that will 
> contain the actual checkpoints, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/
>  * Each checkpoint individually will store all its files in a subdirectory 
> that includes the checkpoint number, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/chk-00003/
>  * Files shared between checkpoints will be stored in the shared/ directory 
> in the same parent directory as the separate checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/shared/
>  * Similar to shared files, files owned strictly by tasks will be stored in 
> the taskowned/ directory in the same parent directory as the separate 
> checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/taskowned/
> h3. Proposal
> Since the individually checkpoint directory does not contain complete state 
> data, we cannot make it relocatable, but its parent directory can. The only 
> work left is make the metadata file references relative file paths.
> I proposal make these changes to _*FsCheckpointStateOutputStream*_ :
>  * introduce _*checkpointDirectory*_ field, and remove *_allowRelativePaths_* 
> field
>  * introduce *_entropyInjecting_* field
>  * *_closeAndGetHandle()_* return _*RelativeFileStateHandle*_ with relative 
> path base on _*checkpointDirectory*_ (except entropy injecting file system)
> [~yunta], [~trohrmann] , I verified this in our environment , and submitted a 
> pull request to accomplish this feature. Please help evaluate whether it is 
> appropriate.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to