[jira] [Updated] (FLINK-35897) Some checkpoint files and localState files can't be cleanUp when checkpoint is aborted

Jinzhong Li (Jira) Thu, 25 Jul 2024 06:26:24 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-35897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jinzhong Li updated FLINK-35897:
--------------------------------
    Description: 
h2. Problem

When the job checkpoint is canceled 
([asyncsnapshotcallable.java/#L129|#L129]]), it is still possible for the 
asynchronous snapshot thread to continue executing and generate a completed 
checkpoint ([RocksIncrementalSnapshotStrategy.java#L324|#L324]]). In this case, 
there will be no role is responsible for the completed checkpoint cleanup, 
neither async snapshot thread, nor SubtaskCheckpointCoordinatorImpl.
h2. How to reproduce it

We can reproduce this issue by running the [DataGenWordCount example in my 
debug 
branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]],
 in which I've added some debug code. 
h2. How to fix it

When the asynchronous snapshot thread completes a checkpoint, it needs to 
cleanup the completed checkpoint if it finds that the checkpoint has been 
canceled.

  was:
h2. Problem

When the job checkpoint is canceled 
([asyncsnapshotcallable.java/#L129|#L129]]), it is still possible for the 
asynchronous snapshot thread to continue executing and generate a completed 
checkpoint 
([RocksIncrementalSnapshotStrategy.java#L324|[https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L324]]).
 In this case, there will be no role is responsible for the completed 
checkpoint cleanup, neither async snapshot thread, nor 
SubtaskCheckpointCoordinatorImpl.

 
h3. How to reproduce it

We can reproduce this issue by running the [DataGenWordCount example in my 
debug 
branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]],
 in which I've added some debug code.
 
h3. How to fix it

When the asynchronous snapshot thread completes a checkpoint, it needs to 
cleanup the completed checkpoint if it finds that the checkpoint has been 
canceled.


> Some checkpoint files and localState files can't be cleanUp when checkpoint 
> is aborted 
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-35897
>                 URL: https://issues.apache.org/jira/browse/FLINK-35897
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / State Backends
>            Reporter: Jinzhong Li
>            Priority: Major
>
> h2. Problem
> When the job checkpoint is canceled 
> ([asyncsnapshotcallable.java/#L129|#L129]]), it is still possible for the 
> asynchronous snapshot thread to continue executing and generate a completed 
> checkpoint ([RocksIncrementalSnapshotStrategy.java#L324|#L324]]). In this 
> case, there will be no role is responsible for the completed checkpoint 
> cleanup, neither async snapshot thread, nor SubtaskCheckpointCoordinatorImpl.
> h2. How to reproduce it
> We can reproduce this issue by running the [DataGenWordCount example in my 
> debug 
> branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]],
>  in which I've added some debug code. 
> h2. How to fix it
> When the asynchronous snapshot thread completes a checkpoint, it needs to 
> cleanup the completed checkpoint if it finds that the checkpoint has been 
> canceled.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-35897) Some checkpoint files and localState files can't be cleanUp when checkpoint is aborted

Reply via email to