lucasgameiroborges opened a new pull request, #1138:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1138
## Motivation
During an application upgrade, `ApplicationReconciler.deploy()` calls
`flinkService.deleteClusterDeployment()` to tear down the old cluster,
then immediately calls `flinkService.submitApplicationCluster()` to
start the new one. If the old JobManager Deployment takes longer to
disappear than `kubernetes-operator.flink.shutdown-cluster.timeout`,
the operator can create a new Deployment while the old one is still
running, causing port conflicts on the `-rest` Service or stale
endpoint routing.
### Root cause
`AbstractFlinkService.deleteBlocking()` waited on
`waitUntilCondition(Objects::isNull, ...)` and caught the resulting
`KubernetesClientTimeoutException` in the generic
`KubernetesClientException` handler (because
`KubernetesClientTimeoutException extends KubernetesClientException`
and `getCode()` returns `0`, not `404`). The method only logged a
`WARN` and returned normally, so the reconciler had no way to detect
that deletion had not actually completed.
## Changes
- **`AbstractFlinkService.deleteBlocking()`** — add an explicit
`catch (KubernetesClientTimeoutException)` *before* the generic
`KubernetesClientException` handler that re-throws the exception.
This causes the reconciler to get an exception and requeue instead
of proceeding to `submitApplicationCluster()`.
- **`NativeFlinkService.shutdownJobManagersBlocking()`** — wrap the
`deleteBlocking()` call for the optional graceful scale-to-zero step
in its own `try/catch (KubernetesClientTimeoutException)`. That step
is best-effort by design (its own lambda already suppresses patch
errors); a timeout there should not prevent the actual Deployment
deletion from proceeding.
## Testing
Existing unit tests in `AbstractFlinkServiceTest` cover the
`deleteBlocking` error-handling paths. A test for the timeout case
(simulating `KubernetesClientTimeoutException` from
`waitUntilCondition`) can be added to verify the new re-throw
behaviour.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]