[PR] [FLINK-39618] FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with default block-on-* options [flink-kubernetes-operator]

via GitHub Wed, 06 May 2026 11:43:13 -0700


Dennis-Mircea opened a new pull request, #1110:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1110


   ## What is the purpose of the change
   
   Fix a deletion deadlock between `SessionJobReconciler` and 
`SessionReconciler` introduced in 1.15 by [FLINK-39271]. When a 
`FlinkDeployment` in session mode is deleted while jobs are running, the 
operator can get permanently stuck: `SessionJobReconciler` skips cancelling the 
Flink job (assuming the cluster will tear down anyway), but `SessionReconciler` 
then refuses to remove the `FlinkDeployment` finalizer because 
`BLOCK_ON_UNMANAGED_JOBS` finds the still-running jobs on the JobManager. Both 
options default to `true`, so the deadlock occurs out of the box.
   
   ## Brief change log
   
     - In `SessionJobReconciler.cleanupInternal`, gate the "skip cancellation 
when the session cluster is being deleted" bypass on `BLOCK_ON_SESSION_JOBS`. 
When that option is enabled (default), fall through to the regular cancel path 
so the Flink job is actually stopped, allowing `SessionReconciler` to release 
the finalizer and complete the cluster teardown.
     - Read `observeConfig` lazily inside the `DELETING`/`DELETED` branch (it 
can be `null` when the JobManager is unreachable, e.g. for unhealthy clusters); 
when `null`, take the bypass since no JM-side guard can fire anyway.
     - Move the existing `observeConfig` access inside the `try` block back to 
a `ctx.getObserveConfig()` call, and replace the deprecated 
`Configuration#getBoolean(ConfigOption)` calls with 
`Configuration#get(ConfigOption)`.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
     - Added 
`SessionJobReconcilerTest#testCleanupWithDeletingClusterBlockOnSessionJobsDisabled`
 (parameterized over `DELETING` and `DELETED`): asserts the bypass still fires 
when `BLOCK_ON_SESSION_JOBS=false`, the finalizer is removed, and the job is 
left untouched.
     - Added 
`SessionJobReconcilerTest#testCleanupWithDeletingClusterBlockOnSessionJobsEnabled`
 (parameterized over `DELETING` and `DELETED`): asserts that with 
`BLOCK_ON_SESSION_JOBS=true` the cancel path is taken, the cluster-side job 
transitions to `CANCELED`, and the reconciler reschedules with the finalizer 
held until cancellation is re-observed — the exact path that breaks the 
deadlock.
     - Added `TestUtils#createContextWithReadyFlinkDeploymentInLifecycleState` 
so tests can produce a context with a `READY` JobManager in a given lifecycle 
state with a custom flink configuration; the pre-existing 
`createContextWithFlinkDeploymentInLifecycleState` always sets the JM to 
`MISSING`, which short-circuits the new gate via `observeConfig == null` and so 
cannot exercise it.
     - Existing `testCleanupWithDeletingSessionCluster`, 
`testCleanupWithDeletedSessionCluster`, and 
`testCleanupWithUnhealthySessionClusterNoHa` continue to pass — they cover the 
third path of the new gate (`observeConfig == null` when the JM is `MISSING`).
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changes to the `CustomResourceDescriptors`: 
no
     - Core observer or reconciler logic that is regularly executed: yes
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [FLINK-39618] FlinkDeployment deletion deadlocks when FlinkSessionJobs are running with default block-on-* options [flink-kubernetes-operator]

Reply via email to