[
https://issues.apache.org/jira/browse/FLINK-11789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782314#comment-16782314
]
Yun Tang commented on FLINK-11789:
----------------------------------
[~shengjk1] When job fails, we should also consider the
{{ExternalizedCheckpointCleanup}}, if the mode is {{DELETE_ON_CANCELLATION}} we
cannot also remove the {{checkpoints/JOB_ID}} folder.
What's more [~till.rohrmann], I am wondering whether we should always to create
the {{JOB_ID}} as part of the path, just as [Stephan Ewen's
comment|https://issues.apache.org/jira/browse/FLINK-9043?focusedCommentId=16409254&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16409254]
once said maybe we might introduce an option to have jobs not creating the
UUID subdirectory, As far as I know, how to recover from previous checkpoint is
a common and annoying problem for many companies. They have to ever change
Flink's source code to find the latest checkpoint automatically or maintain an
externalized system to remember the previous submitted job ids.
Actually, for most companies and people, they would use job-name instead of the
job-id to record the operation. In Alibaba, we introduce an option to not crate
the {{JOB_ID}} and by ensuring jobs with the same name would not have multiple
submitted applications. We could restore automatically by finding the latest
checkpoint. (If we do not use savepoint's advanced feature, we always use
checkpoint instead of savepoint as savepoint would be really slow when we just
want to restart the job as soon as possible, not to say when the state is
really large).
I think it would be worthy to discuss current checkpoint directory layout when
we noticed the not-cleaned-up {{JOB_ID}} sub-directories problem.
> Checkpoint directories are not cleaned up after job termination
> ---------------------------------------------------------------
>
> Key: FLINK-11789
> URL: https://issues.apache.org/jira/browse/FLINK-11789
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing
> Affects Versions: 1.9.0
> Reporter: Till Rohrmann
> Priority: Major
>
> Flink currently does not clean up all checkpoint directories when a job
> reaches a globally terminal state. Having configured the checkpoint directory
> {{checkpoints}}, I observe that after cancelling the job {{JOB_ID}} there are
> still
> {code}
> checkpoints/JOB_ID/shared
> checkpoints/JOB_ID/taskowned
> {code}
> I think it would be good if would delete {{checkpoints/JOB_ID}}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)