lucasgameiroborges opened a new pull request, #1138:
URL: https://github.com/apache/flink-kubernetes-operator/pull/1138

   ## Motivation
   
   During an application upgrade, `ApplicationReconciler.deploy()` calls
   `flinkService.deleteClusterDeployment()` to tear down the old cluster,
   then immediately calls `flinkService.submitApplicationCluster()` to
   start the new one. If the old JobManager Deployment takes longer to
   disappear than `kubernetes-operator.flink.shutdown-cluster.timeout`,
   the operator can create a new Deployment while the old one is still
   running, causing port conflicts on the `-rest` Service or stale
   endpoint routing.
   
   ### Root cause
   
   `AbstractFlinkService.deleteBlocking()` waited on
   `waitUntilCondition(Objects::isNull, ...)` and caught the resulting
   `KubernetesClientTimeoutException` in the generic
   `KubernetesClientException` handler (because
   `KubernetesClientTimeoutException extends KubernetesClientException`
   and `getCode()` returns `0`, not `404`). The method only logged a
   `WARN` and returned normally, so the reconciler had no way to detect
   that deletion had not actually completed.
   
   ## Changes
   
   - **`AbstractFlinkService.deleteBlocking()`** — add an explicit
     `catch (KubernetesClientTimeoutException)` *before* the generic
     `KubernetesClientException` handler that re-throws the exception.
     This causes the reconciler to get an exception and requeue instead
     of proceeding to `submitApplicationCluster()`.
   
   - **`NativeFlinkService.shutdownJobManagersBlocking()`** — wrap the
     `deleteBlocking()` call for the optional graceful scale-to-zero step
     in its own `try/catch (KubernetesClientTimeoutException)`. That step
     is best-effort by design (its own lambda already suppresses patch
     errors); a timeout there should not prevent the actual Deployment
     deletion from proceeding.
   
   ## Testing
   
   Existing unit tests in `AbstractFlinkServiceTest` cover the
   `deleteBlocking` error-handling paths. A test for the timeout case
   (simulating `KubernetesClientTimeoutException` from
   `waitUntilCondition`) can be added to verify the new re-throw
   behaviour.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to