Zhanghao Chen created FLINK-30513:
-------------------------------------
Summary: HA storage dir leaks on cluster termination
Key: FLINK-30513
URL: https://issues.apache.org/jira/browse/FLINK-30513
Project: Flink
Issue Type: Improvement
Components: Runtime / Coordination
Affects Versions: 1.16.0, 1.15.0
Reporter: Zhanghao Chen
Attachments: image-2022-12-27-21-32-17-510.png
*Problem*
We found that HA storage dir leaks on cluster termination for a Flink job with
HA enabled. The following picture shows the HA storage dir (here on HDFS) of
the cluster czh-flink-test-offline (of application mode) after canelling the
job with flink-cancel. We are left with an empty dir, and too many empty dirs
will greatly hurt the stability of HDFS NameNode!
!image-2022-12-27-21-32-17-510.png|width=582,height=158!
Furthermore, in case the user choose to retain the checkpoints on job
termination, we will have the completedCheckpoints leaked as well. Note that we
no longer need the completedCheckpoints files as we'll directly recover
retained CPs from the CP data dir.
*Root Cause*
When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob
store, but didn't clean the HA storage dir.
*Proposal*
Clean up the HA storage dir after cleaning up blob store in
AbstractHaServices#closeAndCleanupAllData().
--
This message was sent by Atlassian Jira
(v8.20.10#820010)