[ 
https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated FLINK-32334:
-----------------------------------
    Labels: pull-request-available  (was: )

> Operator failed to create taskmanager deployment because it already exist
> -------------------------------------------------------------------------
>
>                 Key: FLINK-32334
>                 URL: https://issues.apache.org/jira/browse/FLINK-32334
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.5.0
>            Reporter: Nicolas Fraison
>            Assignee: Nicolas Fraison
>            Priority: Major
>              Labels: pull-request-available
>
> During a job upgrade the operator has failed to start the new job because it 
> has failed to create the taskmanager deployment:
>  
> {code:java}
> Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | 
> {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException:
>  Could not create Kubernetes cluster 
> \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could
>  not create Kubernetes cluster 
> \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure
>  executing: POST at: 
> https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: 
> object is being deleted: deployments.apps \"flink-metering-taskmanager\" 
> already exists. Received status: Status(apiVersion=v1, code=409, 
> details=StatusDetails(causes=[], group=apps, kind=deployments, 
> name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, 
> additionalProperties={}), kind=Status, message=object is being deleted: 
> deployments.apps \"flink-metering-taskmanager\" already exists, 
> metadata=ListMeta(_continue=null, remainingItemCount=null, 
> resourceVersion=null, selfLink=null, additionalProperties={}), 
> reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code}
> As indicated in the error log this is due to taskmanger deployment still 
> existing while it is under deletion.
> Looking at the [source 
> code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150]
>  we are well relying on FOREGROUND policy by default.
> Still it seems that the REST API call to delete only wait until the resource 
> has been modified and the {{deletionTimestamp}} has been added to the 
> metadata: [ensure delete returns only when the delete operation is fully 
> finished -  Issue #3246 -  
> fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899]
> So we could face this issue if the k8s cluster is slow to "really" delete the 
> deployment
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to