[ 
https://issues.apache.org/jira/browse/FLINK-39970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-39970:
-----------------------------------
    Labels: pull-request-available  (was: )

> Kubernetes Operator proceeds with cluster resubmission after Deployment 
> deletion wait timeout, causing AlreadyExists / object-is-being-deleted race
> ---------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-39970
>                 URL: https://issues.apache.org/jira/browse/FLINK-39970
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: 1.14.6
>            Reporter: Bowen Li
>            Priority: Major
>              Labels: pull-request-available
>
> When a Flink job reaches terminal FAILED and 
> `kubernetes.operator.job.restart.failed=true`, the operator deletes the 
> existing cluster and resubmits it.
> If foreground deletion of the old JobManager Deployment exceeds 
> `kubernetes.operator.resource.cleanup.timeout`, 
> `AbstractFlinkService.deleteBlocking()` catches the non-404 
> `KubernetesClientException` from `waitUntilCondition()`, logs it, and returns 
> normally.
> The caller then proceeds as if deletion completed:
> 1. `deleteClusterDeployment()` updates status to 
> JobManagerDeploymentStatus.MISSING.
> 2. The failed-job restart flow calls `resubmitJob()`.
> 3. The operator attempts to create a new Deployment with the same name.
> 4. Kubernetes rejects it because the old Deployment still exists in 
> Terminating state:
> `AlreadyExists: object is being deleted: deployments.apps "<cluster>" already 
> exists`
> *Expected Behavior*
> A non-404 failure or timeout while waiting for Deployment deletion should 
> abort the current reconciliation.
> The operator should not mark the JobManager Deployment as MISSING or attempt 
> to resubmit until Kubernetes confirms the old Deployment is gone. A later 
> reconciliation can retry deletion/resubmission.
> 404 should still be treated as successful deletion.
> *Actual Behavior*
> Deletion wait timeout is logged and swallowed. The same reconciliation 
> continues into resubmission while the old Deployment is still being deleted, 
> causing `AlreadyExists / object is being deleted`.
> *Impact*
> This causes avoidable restart delays and noisy reconciliation failures. In 
> worse cases it can leave the FlinkDeployment in an error/recovery loop 
> requiring manual intervention.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to