[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Stephan Ewen (Jira) Mon, 25 May 2020 11:06:30 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116169#comment-17116169
 ]


Stephan Ewen commented on FLINK-15012:
--------------------------------------

I think there is a very difference between the working/temp directories and the 
checkpoint directories.

The working/temp directories can be cleaned up after processes shut down, 
because no data in them will ever be needed.
The checkpoint directories may contain retained checkpoints or savepoints that 
are still relevant. I think we should not ever try to delete these with things 
like "shutdown hooks".

I understand that job cancellation should remove the job's empty parent 
checkpoint directories. That makes sense. And [~yunta] proposed an issue to fix 
this.

I would question whether we should try and do anything about the 
{{stop-cluster.sh}} behavior. This is forceful wiping of the cluster rather 
than proper shutdown, so left-over data is to be expected. And, in my mind, the 
caution to not accidentally delete a still-needed checkpoint is more important 
than making the "hard stop" as nice as possible (cleanup wise).


> Checkpoint directory not cleaned up
> -----------------------------------
>
>                 Key: FLINK-15012
>                 URL: https://issues.apache.org/jira/browse/FLINK-15012
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1
>            Reporter: Nico Kruber
>            Assignee: Yun Tang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the 
> following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show 
> up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain 
> on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but 
> still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Reply via email to