Bowen Li created FLINK-39970:
--------------------------------

             Summary: Kubernetes Operator proceeds with cluster resubmission 
after Deployment deletion wait timeout, causing AlreadyExists / 
object-is-being-deleted race
                 Key: FLINK-39970
                 URL: https://issues.apache.org/jira/browse/FLINK-39970
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
    Affects Versions: 1.14.6
            Reporter: Bowen Li


When a Flink job reaches terminal FAILED and 
`kubernetes.operator.job.restart.failed=true`, the operator deletes the 
existing cluster and resubmits it.

If foreground deletion of the old JobManager Deployment exceeds 
`kubernetes.operator.resource.cleanup.timeout`, 
`AbstractFlinkService.deleteBlocking()` catches the non-404 
`KubernetesClientException` from `waitUntilCondition()`, logs it, and returns 
normally.

The caller then proceeds as if deletion completed:

1. `deleteClusterDeployment()` updates status to 
JobManagerDeploymentStatus.MISSING.
2. The failed-job restart flow calls `resubmitJob()`.
3. The operator attempts to create a new Deployment with the same name.
4. Kubernetes rejects it because the old Deployment still exists in Terminating 
state:

`AlreadyExists: object is being deleted: deployments.apps "<cluster>" already 
exists`

*Expected Behavior*

A non-404 failure or timeout while waiting for Deployment deletion should abort 
the current reconciliation.

The operator should not mark the JobManager Deployment as MISSING or attempt to 
resubmit until Kubernetes confirms the old Deployment is gone. A later 
reconciliation can retry deletion/resubmission.

404 should still be treated as successful deletion.

*Actual Behavior*

Deletion wait timeout is logged and swallowed. The same reconciliation 
continues into resubmission while the old Deployment is still being deleted, 
causing `AlreadyExists / object is being deleted`.

*Impact*

This causes avoidable restart delays and noisy reconciliation failures. In 
worse cases it can leave the FlinkDeployment in an error/recovery loop 
requiring manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to