[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Nico Kruber (Jira) Tue, 03 Dec 2019 01:53:27 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986760#comment-16986760
 ]


Nico Kruber commented on FLINK-15012:
-------------------------------------

Well, we do have a lot of temp directories that will be deleted with 
{{stop-cluster.sh}}, e.g. blobStorage or flink-io.

However, the checkpoint directory may be special because it is shared between 
the JobManager and the TaskManager processes. Even if the JobManager cleans 
this up, some TaskManager could still be writing to it in case a checkpoint was 
concurrently being created. I did not try, but I am a bit concerned whether 
this may happen in a real cluster setup as well, for example in K8s where you 
may kill the Flink cluster (along with all running jobs) through K8s. Since we 
claim that the checkpoint lifecycle is managed by Flink, it should actually 
always do the cleanup*

 

Looking at the code you linked for ZooKeeperCompletedCheckpointStore as well as 
how StandaloneCompletedCheckpointStore implement their {{shutdown() }}method, I 
am also wondering why they only clean up completed checkpoints. Shouldn't they 
also clean up in-process checkpoints (if possible)?

 

* There may be some strings attached but then they would need to be documented 
so that DevOps may account for that and eventually do a manual cleanup (if the 
checkpoint path lets them identify what to delete).

> Checkpoint directory not cleaned up
> -----------------------------------
>
>                 Key: FLINK-15012
>                 URL: https://issues.apache.org/jira/browse/FLINK-15012
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1
>            Reporter: Nico Kruber
>            Priority: Major
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the 
> following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show 
> up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain 
> on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but 
> still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Reply via email to