[ 
https://issues.apache.org/jira/browse/FLINK-30863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686903#comment-17686903
 ] 

Yanfei Lei commented on FLINK-30863:
------------------------------------

[~roman] Thanks for your reply.
 # Yes, this issue might make local recovery fail after checkpoint abortion, 
and then the job would recovery from remote DFS. This issue doesn't cause data 
loss.
 # In case of many subsequent aborted checkpoints, all aborted local state will 
not be deleted until the next completed checkpoint. Right, this is a 
degradation in some case.  As [~xiarui] 
[suggested|https://github.com/apache/flink/pull/21822#issuecomment-1418605498] 
in PR, I'm going to use reference counting to decide when to delete a file.

> Do not delete the local changelog file of aborted checkpoint
> ------------------------------------------------------------
>
>                 Key: FLINK-30863
>                 URL: https://issues.apache.org/jira/browse/FLINK-30863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.17.0
>            Reporter: Yanfei Lei
>            Assignee: Yanfei Lei
>            Priority: Major
>              Labels: pull-request-available
>
> Do not delete the local changelog file of aborted checkpoint, because this 
> checkpoint may contain the files of the previous checkpoint's file which 
> would be used by local recovery. The local files of the aborted checkpoint 
> would be deleted at next checkpoint completed or deleted when deleting the 
> entire allocation folder when exiting the TM process.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to