[
https://issues.apache.org/jira/browse/FLINK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415426#comment-17415426
]
Feifan Wang commented on FLINK-24149:
-------------------------------------
{quote}One option that we are considering is that newly started job from a
savepoint/retained checkpoint could claim full ownership of the files from
where it started, which would allow that job to have even first checkpoint
fully incremental.
{quote}
Another option I considered is newly started job copy (maybe just a hard link
If the file system behind it supports) files to its exclusive shared
directory, hope it helps.
> Make checkpoint self-contained and relocatable
> ----------------------------------------------
>
> Key: FLINK-24149
> URL: https://issues.apache.org/jira/browse/FLINK-24149
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Checkpointing
> Reporter: Feifan Wang
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2021-09-08-17-06-31-560.png,
> image-2021-09-08-17-10-28-240.png, image-2021-09-08-17-55-46-898.png,
> image-2021-09-08-18-01-03-176.png, image-2021-09-14-14-22-31-537.png
>
>
> h1. Backgroud
> We have many jobs with large state size in production environment. According
> to the operation practice of these jobs and the analysis of some specific
> problems, we believe that RocksDBStateBackend's incremental checkpoint has
> many advantages over savepoint:
> # Savepoint takes much longer time then incremental checkpoint in jobs with
> large state. The figure below is a job in our production environment, it
> takes nearly 7 minutes to complete a savepoint, while checkpoint only takes a
> few seconds.( checkpoint after savepoint takes longer time is a problem
> described in -FLINK-23949-)
> !image-2021-09-08-17-55-46-898.png|width=723,height=161!
> # Savepoint causes excessive cpu usage. The figure below shows the CPU usage
> of the same job in the above figure :
> !image-2021-09-08-18-01-03-176.png|width=516,height=148!
> # Savepoint may cause excessive native memory usage and eventually cause the
> TaskManager process memory usage to exceed the limit. (We did not further
> investigate the cause and did not try to reproduce the problem on other large
> state jobs, but only increased the overhead memory. So this reason may not be
> so conclusive. )
> For the above reasons, we tend to use retained incremental checkpoint to
> completely replace savepoint for jobs with large state size.
> h1. Problems
> * *Problem 1 : retained incremental checkpoint difficult to clean up once
> they used for recovery*
> This problem caused by jobs recoveryed from a retained incremental checkpoint
> may reference files on this retained incremental checkpoint's shared
> directory in subsequent checkpoints, even they are not in a same job
> instance. The worst case is that the retained checkpoint will be referenced
> one by one, forming a very long reference chain.This makes it difficult for
> users to manage retained checkpoints. In fact, we have also suffered failures
> caused by incorrect deletion of retained checkpoints.
> Although we can use the file handle in checkpoint metadata to figure out
> which files can be deleted, but I think it is inappropriate to let users do
> this.
> * *Problem 2 : checkpoint not relocatable*
> Even if we can figure out all files referenced by a checkpoint, moving these
> files will invalidate the checkpoint as well, because the metadata file
> references absolute file paths.
> Since savepoint already be self-contained and relocatable (FLINK-5763), why
> don't we use savepoint just for migrate jobs to another place ? In addition
> to the savepoint performance problem in the background description, a very
> important reason is that the migration requirement may come from the failure
> of the original cluster. In this case, there is no opportunity to trigger
> savepoint.
> h1. Proposal
> * *job's checkpoint directory (user-defined-checkpoint-dir/<jobId>) contains
> all their state files (self-contained)*
> As far as I know, in the current status, only the subsequent checkpoints of
> the jobs restored from the retained checkpoint violate this constraint. One
> possible solution is to re-upload all shared files at the first incremental
> checkpoint after the job started, but we need to discuss how to distinguish
> between a new job instance and a restart.
> * *use relative file path in checkpoint metadata (relocatable)*
> Change all file references in checkpoint metadata to the relative path
> relative to the _metadata file, so we can copy
> user-defined-checkpoint-dir/<jobId> to any other place.
>
> BTW, this issue is so similar to FLINK-5763 , we can read it as a background
> supplement.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)