[ https://issues.apache.org/jira/browse/FLINK-32334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-32334: ----------------------------------- Labels: pull-request-available (was: ) > Operator failed to create taskmanager deployment because it already exist > ------------------------------------------------------------------------- > > Key: FLINK-32334 > URL: https://issues.apache.org/jira/browse/FLINK-32334 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Affects Versions: kubernetes-operator-1.5.0 > Reporter: Nicolas Fraison > Assignee: Nicolas Fraison > Priority: Major > Labels: pull-request-available > > During a job upgrade the operator has failed to start the new job because it > has failed to create the taskmanager deployment: > > {code:java} > Jun 12 19:45:28.115 >>> Status | Error | UPGRADING | > {"type":"org.apache.flink.kubernetes.operator.exception.ReconciliationException","message":"org.apache.flink.client.deployment.ClusterDeploymentException: > Could not create Kubernetes cluster > \"flink-metering\".","throwableList":[{"type":"org.apache.flink.client.deployment.ClusterDeploymentException","message":"Could > not create Kubernetes cluster > \"flink-metering\"."},{"type":"org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException","message":"Failure > executing: POST at: > https://10.129.144.1/apis/apps/v1/namespaces/metering/deployments. Message: > object is being deleted: deployments.apps \"flink-metering-taskmanager\" > already exists. Received status: Status(apiVersion=v1, code=409, > details=StatusDetails(causes=[], group=apps, kind=deployments, > name=flink-metering-taskmanager, retryAfterSeconds=null, uid=null, > additionalProperties={}), kind=Status, message=object is being deleted: > deployments.apps \"flink-metering-taskmanager\" already exists, > metadata=ListMeta(_continue=null, remainingItemCount=null, > resourceVersion=null, selfLink=null, additionalProperties={}), > reason=AlreadyExists, status=Failure, additionalProperties={})."}]} {code} > As indicated in the error log this is due to taskmanger deployment still > existing while it is under deletion. > Looking at the [source > code|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/StandaloneFlinkService.java#L150] > we are well relying on FOREGROUND policy by default. > Still it seems that the REST API call to delete only wait until the resource > has been modified and the {{deletionTimestamp}} has been added to the > metadata: [ensure delete returns only when the delete operation is fully > finished - Issue #3246 - > fabric8io/kubernetes-client|https://github.com/fabric8io/kubernetes-client/issues/3246#issuecomment-874019899] > So we could face this issue if the k8s cluster is slow to "really" delete the > deployment > -- This message was sent by Atlassian Jira (v8.20.10#820010)