[jira] [Commented] (FLINK-24149) Make checkpoint relocatable

Piotr Nowojski (Jira) Mon, 06 Sep 2021 23:52:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17410947#comment-17410947
 ]


Piotr Nowojski commented on FLINK-24149:
----------------------------------------

Can we take a step back and could you [~Feifan Wang] explain in a bit more 
detail what do you want to achieve? You want to migrate the job to use another 
HDFS and while you can do it with a savepoint, those are too expensive for you? 
And in order to do that, what are you proposing? I don't fully understand your 
proposal and what are you trying to achieve, so could you rephrase/elaborate on 
it more? Maybe write down step by step what is currently happening and what 
would happen in your proposal?

Secondly, there are some ongoing discussions/proposals around this topic, for 
example [FLIP-47 Checkpoints vs 
Savepoints|https://cwiki.apache.org/confluence/display/FLINK/FLIP-47%3A+Checkpoints+vs.+Savepoints].
 We are currently in a process of updating this proposal to take into account 
some other requirements, but one thing that we want to change is to more or 
less remove the current distinction between savepoint and checkpoint, such as 
we want to make any snapshot incremental. The new distinction between 
checkpoints and savepoints would be who owns the files that are part of the 
snapshot. For checkpoint Flink would own them and would be responsible for 
cleaning them up. For savepoint user would be. And for example a solution might 
be that any checkpoint could be turned into a savepoint by just copying out the 
files. Wouldn't something like that solve your problem? You would just be able 
to take incremental checkpoint, copy it out to another hdfs and restart the job 
from it?

> Make checkpoint relocatable
> ---------------------------
>
>                 Key: FLINK-24149
>                 URL: https://issues.apache.org/jira/browse/FLINK-24149
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>            Reporter: Feifan Wang
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Backgroud
> FLINK-5763 proposal make savepoint relocatable, checkpoint has similar 
> requirements. For example, to migrate jobs to other HDFS clusters, although 
> it can be achieved through a savepoint, but we prefer to use persistent 
> checkpoints, especially RocksDBStateBackend incremental checkpoints have 
> better performance than savepoint during snapshot and restore.
>  
> FLINK-8531 standardized directory layout :
> {code:java}
> /user-defined-checkpoint-dir
>     |
>     + 1b080b6e710aabbef8993ab18c6de98b (job's ID)
>         |
>         + --shared/
>         + --taskowned/
>         + --chk-00001/
>         + --chk-00002/
>         + --chk-00003/
>         ...
> {code}
>  * State backend will create a subdirectory with the job's ID that will 
> contain the actual checkpoints, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/
>  * Each checkpoint individually will store all its files in a subdirectory 
> that includes the checkpoint number, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/chk-00003/
>  * Files shared between checkpoints will be stored in the shared/ directory 
> in the same parent directory as the separate checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/shared/
>  * Similar to shared files, files owned strictly by tasks will be stored in 
> the taskowned/ directory in the same parent directory as the separate 
> checkpoint directory, such as: 
> user-defined-checkpoint-dir/1b080b6e710aabbef8993ab18c6de98b/taskowned/
> h3. Proposal
> Since the individually checkpoint directory does not contain complete state 
> data, we cannot make it relocatable, but its parent directory can. The only 
> work left is make the metadata file references relative file paths.
> I proposal make these changes to _*FsCheckpointStateOutputStream*_ :
>  * introduce _*checkpointDirectory*_ field, and remove *_allowRelativePaths_* 
> field
>  * introduce *_entropyInjecting_* field
>  * *_closeAndGetHandle()_* return _*RelativeFileStateHandle*_ with relative 
> path base on _*checkpointDirectory*_ (except entropy injecting file system)
> [~yunta], [~trohrmann] , I verified this in our environment , and submitted a 
> pull request to accomplish this feature. Please help evaluate whether it is 
> appropriate.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-24149) Make checkpoint relocatable

Reply via email to