Kartikey Pant created FLINK-37766: ------------------------------------- Summary: FlinkSessionJob deletion blocked by finalizer when Flink job already terminal/missing due to HA desync Key: FLINK-37766 URL: https://issues.apache.org/jira/browse/FLINK-37766 Project: Flink Issue Type: Bug Components: Kubernetes Operator Affects Versions: 1.20.1 Environment: Flink Kubernetes Operator Image: apache/flink-kubernetes-operator:1.10.0
Flink Image: apache/flink:1.20.1 Kubernetes: minikube version: v1.35.0 Reporter: Kartikey Pant We've encountered an issue where {{FlinkSessionJob}} custom resources become stuck in a {{Terminating}} state when deleted via {{{}kubectl delete{}}}. This occurs after a desynchronization between the Flink Kubernetes Operator and the Flink JobManager, typically initiated by a JobManager restart where its High Availability (HA) mechanism fails to recover the state of the pre-existing job. The sequence of events leading to the problem is as follows: # A Flink JobManager pod for an active session cluster restarts. # Upon restart, the JobManager's HA recovery fails to load the state of previously running jobs. JobManager logs indicate this with messages like: {{{}Retrieved job ids [] from KubernetesStateHandleStore...{}}}. # This creates a desynchronization: ** The Flink Operator (via the {{FlinkSessionJob}} CR status) still holds information about the original Flink JobID and its last known state/savepoint. It attempts to reconcile this job. ** The newly started Flink JobManager has no internal record of this specific job instance from its HA recovery. # The {{FlinkSessionJob}} CR status often remains {{RECONCILING}} as the Operator tries to manage a job the current JobManager doesn't recognize from its HA state. # When {{kubectl delete FlinkSessionJob <job-name>}} is issued, the Operator's finalizer ({{{}flinksessionjobs.flink.apache.org/finalizer{}}}) logic is triggered. # The Operator attempts to cancel the Flink job via the JobManager's REST API using the JobID from the CR status. # The Flink JobManager, which either doesn't know the job or has internally marked it as {{FAILED}} due to the ongoing reconciliation attempts for a desynchronized job, responds with an error to the cancellation request. JobManager logs show: {{Job cancellation failed because the job has already reached another terminal state (FAILED).}} # The Flink Kubernetes Operator's REST client logic or the finalizer's error handling does not gracefully process this specific "already FAILED" (or potentially "not found") response. An exception occurs within the Operator (visible in Operator logs, often involving {{RestClient.parseResponse}} or {{{}CompletableFuture.completeExceptionally{}}}). # Due to this unhandled exception in the finalizer logic, the Operator fails to remove its finalizer from the {{FlinkSessionJob}} CR. # Consequently, the {{FlinkSessionJob}} CR remains stuck in the {{Terminating}} state indefinitely. The only workaround is to manually edit the {{FlinkSessionJob}} CR and remove the finalizer, allowing Kubernetes to complete the deletion. *Steps to Reproduce:* # Deploy a Flink Session Cluster with HA enabled (e.g., Kubernetes HA). # Submit a {{FlinkSessionJob}} to the cluster. # Induce a JobManager restart in such a way that its HA metadata for the running job is lost or not recoverable (e.g., by temporarily clearing the HA storage like ConfigMaps before the JobManager fully recovers, or simulating a crash where HA data isn't written). # The new JobManager should start without recovering the previous job. # The {{FlinkSessionJob}} CR may show {{RECONCILING}} as the Operator tries to manage the desynchronized job. # Attempt to delete the {{FlinkSessionJob}} CR using {{{}kubectl delete{}}}. # Observe the Operator logs for exceptions during finalization and the {{FlinkSessionJob}} CR getting stuck in the {{Terminating}} state. -- This message was sent by Atlassian Jira (v8.20.10#820010)