Jinzhong Li created FLINK-35897:
-----------------------------------
Summary: Some checkpoint files and localState files can't be
cleanUp when checkpoint is aborted
Key: FLINK-35897
URL: https://issues.apache.org/jira/browse/FLINK-35897
Project: Flink
Issue Type: Bug
Components: Runtime / Checkpointing, Runtime / State Backends
Reporter: Jinzhong Li
h2. Problem
When the job checkpoint is canceled ([asyncsnapshotcallable.java/#L129|
[https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]]),
it is still possible for the asynchronous snapshot thread to continue
executing and generate a completed checkpoint
([RocksIncrementalSnapshotStrategy.java#L324|
[https://github.com/apache/flink/blob/d4294c59e6f2ec8702f53916ea49cf23f6db8961/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L324]]).
In this case, there will be no role is responsible for the completed
checkpoint cleanup, neither async snapshot thread, nor
SubtaskCheckpointCoordinatorImpl.
h3. How to reproduce it
We can reproduce this issue by running the [DataGenWordCount example in my
debug
branch|[https://github.com/ljz2051/flink/commit/33c0c55098a49a0b56c9404256a560da5069f26c]],
in which I've added some debug code.
h3. How to fix it
When the asynchronous snapshot thread completes a checkpoint, it needs to
cleanup the completed checkpoint if it finds that the checkpoint has been
canceled.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)