Lucas Borges created FLINK-39953:
------------------------------------

             Summary: AbstractFlinkService.deleteBlocking() swallows 
KubernetesClientTimeoutException allowing cluster upgrade to proceed on top of 
still-running cluster
                 Key: FLINK-39953
                 URL: https://issues.apache.org/jira/browse/FLINK-39953
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: kubernetes-operator-1.15.0
         Environment: Operator 1.12.1
Flink 1.20.3
            Reporter: Lucas Borges


When a `FlinkDeployment` upgrade is triggered, the operator calls 
`deleteClusterDeployment()` before deploying the new cluster. Internally, 
`deleteClusterInternal()` calls `deleteDeploymentBlocking()` → 
`deleteBlocking()`, which issues the Kubernetes DELETE and then calls 
`waitUntilCondition(Objects::isNull, timeout, MILLISECONDS)` on the fabric8 
watch.

If the old cluster does not disappear within the configured 
`kubernetes.operator.cluster.shutdown-timeout`, the fabric8 client throws 
`KubernetesClientTimeoutException`. Because `KubernetesClientTimeoutException 
extends KubernetesClientException` and overrides `getCode()` to return `0` (no 
HTTP status code), it falls into the existing generic catch block:

```
} catch (KubernetesClientException kce) {
    // We completely ignore not found errors and simply log others
    if (kce.getCode() != HttpURLConnection.HTTP_NOT_FOUND) {
        LOG.warn("Error while " + operation, kce);
    }
}
```

The condition `0 != 404` is true, so only a WARN is emitted and the exception 
is **silently discarded**. `updateStatusAfterClusterDeletion()` runs, marking 
the cluster as deleted in the status, and the operator immediately proceeds to 
call `submitApplicationCluster()`, deploying the new version on top of a 
still-running cluster.

### Code Path

```
ApplicationReconciler.deploy()
  └─ deleteClusterDeployment()          ← calls deleteClusterInternal() + 
updateStatusAfterClusterDeletion()
       └─ deleteClusterInternal()
            └─ deleteDeploymentBlocking()
                 └─ deleteBlocking()
                      └─ waitUntilCondition(Objects::isNull, timeout)
                           → throws KubernetesClientTimeoutException
                           → caught by KubernetesClientException handler   ← 
BUG: exception swallowed
  └─ submitApplicationCluster()         ← called unconditionally, runs against 
still-alive cluster
```

### Operator Log Evidence

```
WARN  AbstractFlinkService - Error while deleting JobManager Deployment
      io.fabric8.kubernetes.client.KubernetesClientTimeoutException: Timed out 
waiting for ...
INFO  NativeFlinkService - Deploying application cluster    ← should NOT appear 
after timeout
```

### Proposed Fix

Add an explicit `catch (KubernetesClientTimeoutException)` block **before** the 
generic `KubernetesClientException`
handler in `AbstractFlinkService.deleteBlocking()` and re-throw it:

```java
} catch (KubernetesClientTimeoutException e) {
    throw e;
} catch (KubernetesClientException kce) {
    // We completely ignore not found errors and simply log others
    if (kce.getCode() != HttpURLConnection.HTTP_NOT_FOUND) {
        LOG.warn("Error while " + operation, kce);
    }
}
```

If `deleteBlocking()` times out waiting for the old cluster to disappear, the 
exception propagates out of `deploy()`, the reconciler catches it as a 
`ReconciliationException`, sets the status to `UPGRADING` (already persisted 
before `deploy()` is entered), and JOSDK schedules an exponential-backoff 
retry. On retry the operator attempts deletion again. The new cluster is not 
submitted until the old one is confirmed gone.

The one callsite where a timeout should be absorbed rather than propagated is 
`shutdownJobManagersBlocking()` in `NativeFlinkService`, which performs an 
optional scale-to-zero step before the real deletion. That site catches 
`KubernetesClientTimeoutException` locally, logs a warning, and proceeds.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to