Dennis-Mircea opened a new pull request, #1110:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1110
## What is the purpose of the change
Fix a deletion deadlock between `SessionJobReconciler` and
`SessionReconciler` introduced in 1.15 by [FLINK-39271]. When a
`FlinkDeployment` in session mode is deleted while jobs are running, the
operator can get permanently stuck: `SessionJobReconciler` skips cancelling the
Flink job (assuming the cluster will tear down anyway), but `SessionReconciler`
then refuses to remove the `FlinkDeployment` finalizer because
`BLOCK_ON_UNMANAGED_JOBS` finds the still-running jobs on the JobManager. Both
options default to `true`, so the deadlock occurs out of the box.
## Brief change log
- In `SessionJobReconciler.cleanupInternal`, gate the "skip cancellation
when the session cluster is being deleted" bypass on `BLOCK_ON_SESSION_JOBS`.
When that option is enabled (default), fall through to the regular cancel path
so the Flink job is actually stopped, allowing `SessionReconciler` to release
the finalizer and complete the cluster teardown.
- Read `observeConfig` lazily inside the `DELETING`/`DELETED` branch (it
can be `null` when the JobManager is unreachable, e.g. for unhealthy clusters);
when `null`, take the bypass since no JM-side guard can fire anyway.
- Move the existing `observeConfig` access inside the `try` block back to
a `ctx.getObserveConfig()` call, and replace the deprecated
`Configuration#getBoolean(ConfigOption)` calls with
`Configuration#get(ConfigOption)`.
## Verifying this change
This change added tests and can be verified as follows:
- Added
`SessionJobReconcilerTest#testCleanupWithDeletingClusterBlockOnSessionJobsDisabled`
(parameterized over `DELETING` and `DELETED`): asserts the bypass still fires
when `BLOCK_ON_SESSION_JOBS=false`, the finalizer is removed, and the job is
left untouched.
- Added
`SessionJobReconcilerTest#testCleanupWithDeletingClusterBlockOnSessionJobsEnabled`
(parameterized over `DELETING` and `DELETED`): asserts that with
`BLOCK_ON_SESSION_JOBS=true` the cancel path is taken, the cluster-side job
transitions to `CANCELED`, and the reconciler reschedules with the finalizer
held until cancellation is re-observed — the exact path that breaks the
deadlock.
- Added `TestUtils#createContextWithReadyFlinkDeploymentInLifecycleState`
so tests can produce a context with a `READY` JobManager in a given lifecycle
state with a custom flink configuration; the pre-existing
`createContextWithFlinkDeploymentInLifecycleState` always sets the JM to
`MISSING`, which short-circuits the new gate via `observeConfig == null` and so
cannot exercise it.
- Existing `testCleanupWithDeletingSessionCluster`,
`testCleanupWithDeletedSessionCluster`, and
`testCleanupWithUnhealthySessionClusterNoHa` continue to pass — they cover the
third path of the new gate (`observeConfig == null` when the JM is `MISSING`).
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changes to the `CustomResourceDescriptors`:
no
- Core observer or reconciler logic that is regularly executed: yes
## Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? not applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]