Kartikey Pant created FLINK-37766:
-------------------------------------

             Summary: FlinkSessionJob deletion blocked by finalizer when Flink 
job already terminal/missing due to HA desync
                 Key: FLINK-37766
                 URL: https://issues.apache.org/jira/browse/FLINK-37766
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.20.1
         Environment: Flink Kubernetes Operator Image: 
apache/flink-kubernetes-operator:1.10.0

Flink Image: apache/flink:1.20.1

Kubernetes: minikube version: v1.35.0
            Reporter: Kartikey Pant


We've encountered an issue where {{FlinkSessionJob}} custom resources become 
stuck in a {{Terminating}} state when deleted via {{{}kubectl delete{}}}. This 
occurs after a desynchronization between the Flink Kubernetes Operator and the 
Flink JobManager, typically initiated by a JobManager restart where its High 
Availability (HA) mechanism fails to recover the state of the pre-existing job.

The sequence of events leading to the problem is as follows:
 # A Flink JobManager pod for an active session cluster restarts.
 # Upon restart, the JobManager's HA recovery fails to load the state of 
previously running jobs. JobManager logs indicate this with messages like: 
{{{}Retrieved job ids [] from KubernetesStateHandleStore...{}}}.
 # This creates a desynchronization:
 ** The Flink Operator (via the {{FlinkSessionJob}} CR status) still holds 
information about the original Flink JobID and its last known state/savepoint. 
It attempts to reconcile this job.
 ** The newly started Flink JobManager has no internal record of this specific 
job instance from its HA recovery.
 # The {{FlinkSessionJob}} CR status often remains {{RECONCILING}} as the 
Operator tries to manage a job the current JobManager doesn't recognize from 
its HA state.
 # When {{kubectl delete FlinkSessionJob <job-name>}} is issued, the Operator's 
finalizer ({{{}flinksessionjobs.flink.apache.org/finalizer{}}}) logic is 
triggered.
 # The Operator attempts to cancel the Flink job via the JobManager's REST API 
using the JobID from the CR status.
 # The Flink JobManager, which either doesn't know the job or has internally 
marked it as {{FAILED}} due to the ongoing reconciliation attempts for a 
desynchronized job, responds with an error to the cancellation request. 
JobManager logs show: {{Job cancellation failed because the job has already 
reached another terminal state (FAILED).}}
 # The Flink Kubernetes Operator's REST client logic or the finalizer's error 
handling does not gracefully process this specific "already FAILED" (or 
potentially "not found") response. An exception occurs within the Operator 
(visible in Operator logs, often involving {{RestClient.parseResponse}} or 
{{{}CompletableFuture.completeExceptionally{}}}).
 # Due to this unhandled exception in the finalizer logic, the Operator fails 
to remove its finalizer from the {{FlinkSessionJob}} CR.
 # Consequently, the {{FlinkSessionJob}} CR remains stuck in the 
{{Terminating}} state indefinitely.

The only workaround is to manually edit the {{FlinkSessionJob}} CR and remove 
the finalizer, allowing Kubernetes to complete the deletion.

 

*Steps to Reproduce:*
 # Deploy a Flink Session Cluster with HA enabled (e.g., Kubernetes HA).
 # Submit a {{FlinkSessionJob}} to the cluster.
 # Induce a JobManager restart in such a way that its HA metadata for the 
running job is lost or not recoverable (e.g., by temporarily clearing the HA 
storage like ConfigMaps before the JobManager fully recovers, or simulating a 
crash where HA data isn't written).
 # The new JobManager should start without recovering the previous job.
 # The {{FlinkSessionJob}} CR may show {{RECONCILING}} as the Operator tries to 
manage the desynchronized job.
 # Attempt to delete the {{FlinkSessionJob}} CR using {{{}kubectl delete{}}}.
 # Observe the Operator logs for exceptions during finalization and the 
{{FlinkSessionJob}} CR getting stuck in the {{Terminating}} state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to