junzhong qin created FLINK-38215: ------------------------------------ Summary: Remove HA data according to config option `job.savepoint-on-deletion` when cleanup terminal JM after ttl passed Key: FLINK-38215 URL: https://issues.apache.org/jira/browse/FLINK-38215 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: junzhong qin
When a bounded job FINISHED, the HA data is still kept after the JM deployment deleted. As a result the job is kept in the reconcile loop. The following is a log snippet from the operator for a FINISHED bounded stream job: {code:java} //代码占位符 2025-08-07 20:51:47,786 INFO org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler [] [bounded-job-test] - Removing JobManager deployment for terminal application.2025-08-07 20:51:47,786 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Deleting cluster with Foreground propagation2025-08-07 20:51:47,787 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Scaling JobManager Deployment to zero with 60 seconds timeout...2025-08-07 20:51:49,124 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Completed Scaling JobManager Deployment to zero2025-08-07 20:51:49,125 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Deleting JobManager Deployment with 298 seconds timeout...2025-08-07 20:51:51,148 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Completed Deleting JobManager Deployment2025-08-07 20:51:51,151 INFO org.apache.flink.kubernetes.operator.service.AbstractFlinkService [] [bounded-job-test] - Keeping HA metadata for last-state restore2025-08-07 20:51:51,290 INFO org.apache.flink.kubernetes.operator.observer.deployment.ApplicationObserver [] [bounded-job-test] - Observing JobManager deployment. Previous status: MISSING2025-08-07 20:51:51,296 INFO org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler [] [bounded-job-test] - Resource fully reconciled, nothing to do...2025-08-07 20:52:06,315 INFO org.apache.flink.kubernetes.operator.observer.deployment.ApplicationObserver [] [bounded-job-test] - Observing JobManager deployment. Previous status: MISSING2025-08-07 20:52:06,321 INFO org.apache.flink.kubernetes.operator.reconciler.deployment.AbstractFlinkResourceReconciler [] [bounded-job-test] - Resource fully reconciled, nothing to do...{code} >From the log we can see the HA metadata is kept for last-state restore which >is not needed actually and will be reconciled in the reconcile loop. IIUC, the HA metadata should be cleanup according to the config `job.savepoint-on-deletion` or SuspendMode. And I’d like to contribute a PR to fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010)