[jira] [Updated] (FLINK-30513) HA storage dir leaks on cluster termination

Zhanghao Chen (Jira) Tue, 27 Dec 2022 05:44:06 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-30513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Zhanghao Chen updated FLINK-30513:
----------------------------------
    Description: 
*Problem*

We found that HA storage dir leaks on cluster termination for a Flink job with 
HA enabled. The following picture shows the HA storage dir (here on HDFS) of 
the cluster czh-flink-test-offline (of application mode) after canelling the 
job with flink-cancel. We are left with an empty dir, and too many empty dirs 
will greatly hurt the stability of HDFS NameNode!

!image-2022-12-27-21-32-17-510.png|width=582,height=158!

 

Furthermore, in case the user choose to retain the checkpoints on job 
termination, we will have the completedCheckpoints leaked as well. Note that we 
no longer need the completedCheckpoints files as we'll directly recover 
retained CPs from the CP data dir.

*Root Cause*

When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob 
store, but didn't clean the HA storage dir.

*Proposal*

Clean up the HA storage dir after cleaning up blob store in 
AbstractHaServices#closeAndCleanupAllData().

  was:
*Problem*

We found that HA storage dir leaks on cluster termination for a Flink job with 
HA enabled. The following picture shows the HA storage dir (here on HDFS) of 
the cluster czh-flink-test-offline (of application mode) after canelling the 
job with flink-cancel. We are left with an empty dir, and too many empty dirs 
will greatly hurt the stability of HDFS NameNode!  
!image-2022-12-27-21-32-17-510.png|width=582,height=158!

Furthermore, in case the user choose to retain the checkpoints on job 
termination, we will have the completedCheckpoints leaked as well. Note that we 
no longer need the completedCheckpoints files as we'll directly recover 
retained CPs from the CP data dir.

*Root Cause*

When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob 
store, but didn't clean the HA storage dir.

*Proposal*

Clean up the HA storage dir after cleaning up blob store in 
AbstractHaServices#closeAndCleanupAllData().


> HA storage dir leaks on cluster termination 
> --------------------------------------------
>
>                 Key: FLINK-30513
>                 URL: https://issues.apache.org/jira/browse/FLINK-30513
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0, 1.16.0
>            Reporter: Zhanghao Chen
>            Priority: Major
>         Attachments: image-2022-12-27-21-32-17-510.png
>
>
> *Problem*
> We found that HA storage dir leaks on cluster termination for a Flink job 
> with HA enabled. The following picture shows the HA storage dir (here on 
> HDFS) of the cluster czh-flink-test-offline (of application mode) after 
> canelling the job with flink-cancel. We are left with an empty dir, and too 
> many empty dirs will greatly hurt the stability of HDFS NameNode!
> !image-2022-12-27-21-32-17-510.png|width=582,height=158!
>  
> Furthermore, in case the user choose to retain the checkpoints on job 
> termination, we will have the completedCheckpoints leaked as well. Note that 
> we no longer need the completedCheckpoints files as we'll directly recover 
> retained CPs from the CP data dir.
> *Root Cause*
> When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob 
> store, but didn't clean the HA storage dir.
> *Proposal*
> Clean up the HA storage dir after cleaning up blob store in 
> AbstractHaServices#closeAndCleanupAllData().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (FLINK-30513) HA storage dir leaks on cluster termination

Reply via email to