[jira] [Commented] (FLINK-11789) Checkpoint directories are not cleaned up after job termination

Yun Tang (JIRA) Fri, 01 Mar 2019 23:56:56 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-11789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782314#comment-16782314
 ]


Yun Tang commented on FLINK-11789:
----------------------------------

[~shengjk1] When job fails, we should also consider the 
{{ExternalizedCheckpointCleanup}}, if the mode is {{DELETE_ON_CANCELLATION}} we 
cannot also remove the {{checkpoints/JOB_ID}} folder.

What's more [~till.rohrmann], I am wondering whether we should always to create 
the {{JOB_ID}} as part of the path, just as [Stephan Ewen's 
comment|https://issues.apache.org/jira/browse/FLINK-9043?focusedCommentId=16409254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16409254]
 once said maybe we might introduce an option to have jobs not creating the 
UUID subdirectory, As far as I know, how to recover from previous checkpoint is 
a common and annoying problem for many companies. They have to ever change 
Flink's source code to find the latest checkpoint automatically or maintain an 
externalized system to remember the previous submitted job ids.

Actually, for most companies and people, they would use job-name instead of the 
job-id to record the operation. In Alibaba, we introduce an option to not crate 
the {{JOB_ID}} and by ensuring jobs with the same name would not have multiple 
submitted applications. We could restore automatically by finding the latest 
checkpoint. (If we do not use savepoint's advanced feature, we always use 
checkpoint instead of savepoint as savepoint would be really slow when we just 
want to restart the job as soon as possible, not to say when the state is 
really large).

I think it would be worthy to discuss current checkpoint directory layout when 
we noticed the not-cleaned-up {{JOB_ID}} sub-directories problem.

> Checkpoint directories are not cleaned up after job termination
> ---------------------------------------------------------------
>
>                 Key: FLINK-11789
>                 URL: https://issues.apache.org/jira/browse/FLINK-11789
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Priority: Major
>
> Flink currently does not clean up all checkpoint directories when a job 
> reaches a globally terminal state. Having configured the checkpoint directory 
> {{checkpoints}}, I observe that after cancelling the job {{JOB_ID}} there are 
> still
> {code}
> checkpoints/JOB_ID/shared
> checkpoints/JOB_ID/taskowned
> {code}
> I think it would be good if would delete {{checkpoints/JOB_ID}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-11789) Checkpoint directories are not cleaned up after job termination

Reply via email to