Bowen Li created FLINK-39970:
--------------------------------
Summary: Kubernetes Operator proceeds with cluster resubmission
after Deployment deletion wait timeout, causing AlreadyExists /
object-is-being-deleted race
Key: FLINK-39970
URL: https://issues.apache.org/jira/browse/FLINK-39970
Project: Flink
Issue Type: Bug
Components: Kubernetes Operator
Affects Versions: 1.14.6
Reporter: Bowen Li
When a Flink job reaches terminal FAILED and
`kubernetes.operator.job.restart.failed=true`, the operator deletes the
existing cluster and resubmits it.
If foreground deletion of the old JobManager Deployment exceeds
`kubernetes.operator.resource.cleanup.timeout`,
`AbstractFlinkService.deleteBlocking()` catches the non-404
`KubernetesClientException` from `waitUntilCondition()`, logs it, and returns
normally.
The caller then proceeds as if deletion completed:
1. `deleteClusterDeployment()` updates status to
JobManagerDeploymentStatus.MISSING.
2. The failed-job restart flow calls `resubmitJob()`.
3. The operator attempts to create a new Deployment with the same name.
4. Kubernetes rejects it because the old Deployment still exists in Terminating
state:
`AlreadyExists: object is being deleted: deployments.apps "<cluster>" already
exists`
*Expected Behavior*
A non-404 failure or timeout while waiting for Deployment deletion should abort
the current reconciliation.
The operator should not mark the JobManager Deployment as MISSING or attempt to
resubmit until Kubernetes confirms the old Deployment is gone. A later
reconciliation can retry deletion/resubmission.
404 should still be treated as successful deletion.
*Actual Behavior*
Deletion wait timeout is logged and swallowed. The same reconciliation
continues into resubmission while the old Deployment is still being deleted,
causing `AlreadyExists / object is being deleted`.
*Impact*
This causes avoidable restart delays and noisy reconciliation failures. In
worse cases it can leave the FlinkDeployment in an error/recovery loop
requiring manual intervention.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)