[
https://issues.apache.org/jira/browse/FLINK-26606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matthias Pohl updated FLINK-26606:
----------------------------------
Description:
We introduced a repeatable per-job cleanup after the job reached a
globally-terminated state. It also tries to clean up the
{{CompletedCheckpointStore}}. But we missed one code path where
{{CompletedCheckpoints}} are tried to be discarded in the
{{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any
references to these {{CompletedCheckpoints}} anymore. The shutdown at the end
is not able to clean these checkpoints up.
We should not remove the {{CompletedCheckpoints}} from the
{{CompletedCheckpointStore}} if the deletion failed. This would enable us to
retry deleting these artifacts at the end of the job and consider them in the
retryable cleanup as well.
The documentation was updated to cover this issue. Fixing this issue should
also include removing the corresponding paragraph from the documentation (see
[related flink-docs PR|https://github.com/apache/flink/pull/19058]).
was:
We introduced a repeatable per-job cleanup after the job reached a
globally-terminated state. It also tries to clean up the
{{CompletedCheckpointStore}}. But we missed one code path where
{{CompletedCheckpoints}} are tried to be discarded in the
{{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any
references to these {{CompletedCheckpoints}} anymore. The shutdown at the end
is not able to clean these checkpoints up.
We should not remove the {{CompletedCheckpoints}} from the
{{CompletedCheckpointStore}} if the deletion failed. This would enable us to
retry deleting these artifacts at the end of the job and consider them in the
retryable cleanup as well.
> CompletedCheckpoints that failed to be discarded are not stored in the
> CompletedCheckpointStore
> -----------------------------------------------------------------------------------------------
>
> Key: FLINK-26606
> URL: https://issues.apache.org/jira/browse/FLINK-26606
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.15.0
> Reporter: Matthias Pohl
> Priority: Major
>
> We introduced a repeatable per-job cleanup after the job reached a
> globally-terminated state. It also tries to clean up the
> {{CompletedCheckpointStore}}. But we missed one code path where
> {{CompletedCheckpoints}} are tried to be discarded in the
> {{CheckpointsCleaner}}. The {{CompletedCheckpointStore}} does not hold any
> references to these {{CompletedCheckpoints}} anymore. The shutdown at the end
> is not able to clean these checkpoints up.
> We should not remove the {{CompletedCheckpoints}} from the
> {{CompletedCheckpointStore}} if the deletion failed. This would enable us to
> retry deleting these artifacts at the end of the job and consider them in the
> retryable cleanup as well.
> The documentation was updated to cover this issue. Fixing this issue should
> also include removing the corresponding paragraph from the documentation (see
> [related flink-docs PR|https://github.com/apache/flink/pull/19058]).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)