[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Yun Tang (Jira) Mon, 02 Dec 2019 10:05:03 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16986241#comment-16986241
 ]


Yun Tang commented on FLINK-15012:
----------------------------------

[~NicoK], I think there existed several problems here:
 * for {{stop-cluster.sh}}, they would like to kill the java process directly 
and would not trigger the shutdown phase for checkpoint store. I think we might 
need to refactor this part like to call cancel all jobs first.
 * for the remaining empty {{job-id}} directories after canceling job, this is 
because current {{CheckpointRetentionPolicy.NEVER_RETAIN_AFTER_TERMINATION}} 
would only take effect when checkpoint store [shut down 
|https://github.com/apache/flink/blob/b0fc92b4883270faec68bde70403fed8cc8bd15a/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/ZooKeeperCompletedCheckpointStore.java#L270].
 However, only specific checkpoints are effected which means only {{chk-id}} 
folder would be removed. If we want the whole {{job-id}} directory to be 
removed, we also need to introduce {{shutdown(JobStatus)}} to 
{{CheckpointStorageCoordinatorView}} as this is the only place to know the base 
{{job-id}} directory.

Since this problem existed for a long time, if we decide to resolve this and I 
think I could help to change {{CheckpointStorageCoordinatorView}} as I 
introduce this class ever. For the {{stop-cluster.sh}} problem, we might need 
more discussions.

> Checkpoint directory not cleaned up
> -----------------------------------
>
>                 Key: FLINK-15012
>                 URL: https://issues.apache.org/jira/browse/FLINK-15012
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1
>            Reporter: Nico Kruber
>            Priority: Major
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the 
> following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show 
> up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain 
> on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but 
> still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

Reply via email to