[
https://issues.apache.org/jira/browse/FLINK-30863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686903#comment-17686903
]
Yanfei Lei commented on FLINK-30863:
------------------------------------
[~roman] Thanks for your reply.
# Yes, this issue might make local recovery fail after checkpoint abortion,
and then the job would recovery from remote DFS. This issue doesn't cause data
loss.
# In case of many subsequent aborted checkpoints, all aborted local state will
not be deleted until the next completed checkpoint. Right, this is a
degradation in some case. As [~xiarui]
[suggested|https://github.com/apache/flink/pull/21822#issuecomment-1418605498]
in PR, I'm going to use reference counting to decide when to delete a file.
> Do not delete the local changelog file of aborted checkpoint
> ------------------------------------------------------------
>
> Key: FLINK-30863
> URL: https://issues.apache.org/jira/browse/FLINK-30863
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / State Backends
> Affects Versions: 1.17.0
> Reporter: Yanfei Lei
> Assignee: Yanfei Lei
> Priority: Major
> Labels: pull-request-available
>
> Do not delete the local changelog file of aborted checkpoint, because this
> checkpoint may contain the files of the previous checkpoint's file which
> would be used by local recovery. The local files of the aborted checkpoint
> would be deleted at next checkpoint completed or deleted when deleting the
> entire allocation folder when exiting the TM process.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)