[
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17455237#comment-17455237
]
Piotr Nowojski commented on FLINK-24149:
----------------------------------------
[~Feifan Wang] I've noticed that you opened a PR for this feature
([~dwysakowicz] has already written down why we think that your PR is
incorrect). Here I would like to re-open the discussion if this issue is still
valid or not with context of the
[FLIP-193|https://cwiki.apache.org/confluence/display/FLINK/FLIP-193%3A+Snapshots+ownership],
and it's future follow up (also planned for 1.15, Incremental native format
savepoints).
1. Incremental native format savepoints I think will help address your
background problem. As long as your filesystem will be supporting cheap
duplication of the artefacts, such savepoints will be just as quick as
incremental checkpoints.
2. "Problem 1 : retained incremental checkpoint difficult to clean up once they
used for recovery" is going to be addressed directly by FLIP-193, with it's
{{claim}} and {{no-claim}} recovery modes.
3. As such "Problem 2 : checkpoint not relocatable", would have little value.
It would only make sense for scenarios when user is not able to complete a
savepoint, for example during some disaster recovery, when user has to manually
move checkpoint files to a new cluster.
However as we pointed out before, relocatable checkpoints are only easily
do-able for self contained checkpoints (non incremental ones). With incremental
checkpoints, it's tricky to handle relative directories. First of all (as
visible in the test failures in your PR) we would need to re-relativize file
paths with respect to the new {{_metadata}} file. But even then, that's quite
fishy. How would user know which files to relocate, if they are spread among
multiple directories? As long as your checkpoint references only previous files
from the same job, I could imagine user relocating the directory containing all
of the checkpoints, but that's only a part of the story. Referenced files can
be in completely different root directory, or even completely different file
system, so in general case it would be quite difficult to achieve.
All in all, I would strongly suggest to drop this feature request for any
foreseeable future, since incremental savepoints should solve most of the use
cases here.
If that's not enough, I can imagine making self contained checkpoints
(non-incremental) relocatable. It would be quite easy to understand and explain
to the users and it has some value as I mentioned before (disaster recovery if
user is unable to restart Flink job in the same cluster without first
relocating checkpoints). But this is not as simple as your current proposal and
it wouldn't solve your problem (as you are using incremental checkpoints).
Supporting relocation of incremental checkpoints I think is difficult if we
want to support all cases (including checkpoints that are stored in different
file systems). On top of that, I don't see how user would know hot to even
relocate such checkpoint that is spread among many directories/filesystems? How
should he know which files to copy? If we want to limit ourselves to files in
the same root checkpoints directory, it becomes a bit inconsistent. How user
would know whether this particular checkpoint is relocatable or not? So I would
be also against doing that. It would be a limited value dangerous feature,
that's not trivial to implement.
> Make checkpoint self-contained and relocatable
> ----------------------------------------------
>
> Key: FLINK-24149
> URL: https://issues.apache.org/jira/browse/FLINK-24149
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Reporter: Feifan Wang
> Priority: Major
> Labels: pull-request-available, stale-major
> Attachments: image-2021-09-08-17-06-31-560.png,
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png,
> image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png
>
>
> h1. Backgroud
> We have many jobs with large state size in production environment. According
> to the operation practice of these jobs and the analysis of some specific
> problems, we believe that RocksDBStateBackend's incremental checkpoint has
> many advantages over savepoint:
> # Savepoint takes much longer time then incremental checkpoint in jobs with
> large state. The figure below is a job in our production environment, it
> takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a
> few seconds.( checkpoint after savepoint takes longer time is a problem
> described in -FLINK-23949-)
> !image-2021-09-08-17-55-46-898.png|width=723,height=161!
> # Savepoint causes excessive cpu usage. The figure below shows the CPU usage
> of the same job in the above figure :
> !image-2021-09-08-18-01-03-176.png|width=516,height=148!
> # Savepoint may cause excessive native memory usage and eventually cause the
> TaskManager process memory usage to exceed the limit. (We did not further
> investigate the cause and did not try to reproduce the problem on other large
> state jobs, but only increased the overhead memory. So this reason may not be
> so conclusive. )
> For the above reasons, we tend to use retained incremental checkpoint to
> completely replace savepoint for jobs with large state size.
> h1. Problems
> * *Problem 1 : retained incremental checkpoint difficult to clean up once
> they used for recovery*
> This problem caused by jobs recoveryed from a retained incremental checkpoint
> may reference files on this retained incremental checkpoint's shared
> directory in subsequent checkpoints, even they are not in a same job
> instance. The worst case is that the retained checkpoint will be referenced
> one by one, forming a very long reference chain.This makes it difficult for
> users to manage retained checkpoints. In fact, we have also suffered failures
> caused by incorrect deletion of retained checkpoints.
> Although we can use the file handle in checkpoint metadata to figure out
> which files can be deleted, but I think it is inappropriate to let users do
> this.
> * *Problem 2 : checkpoint not relocatable*
> Even if we can figure out all files referenced by a checkpoint, moving these
> files will invalidate the checkpoint as well, because the metadata file
> references absolute file paths.
> Since savepoint already be self-contained and relocatable (FLINK-5763), why
> don't we use savepoint just for migrate jobs to another place ? In addition
> to the savepoint performance problem in the background description, a very
> important reason is that the migration requirement may come from the failure
> of the original cluster. In this case, there is no opportunity to trigger
> savepoint.
> h1. Proposal
> * *job's checkpoint directory (user-defined-checkpoint-dir/<jobId>) contains
> all their state files (self-contained)*
> As far as I know, in the current status, only the subsequent checkpoints of
> the jobs restored from the retained checkpoint violate this constraint. One
> possible solution is to re-upload all shared files at the first incremental
> checkpoint after the job started, but we need to discuss how to distinguish
> between a new job instance and a restart.
> * *use relative file path in checkpoint metadata (relocatable)*
> Change all file references in checkpoint metadata to the relative path
> relative to the _metadata file, so we can copy
> user-defined-checkpoint-dir/<jobId> to any other place.
>
> BTW, this issue is so similar to FLINK-5763 , we can read it as a background
> supplement.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)