BillSullivan2020 commented on issue #24015:
URL: https://github.com/apache/airflow/issues/24015#issuecomment-1210792269

   Hello.
   i put this issue aside and stopped working on it. Now i have allocated some 
time again.
   i have some feedback & a question.
   Duplicate metadata was probable - however now i have the simplest-dag 
possible to reproduce the problem. i will show you
   
   - I upgraded to airflow-2.3.3
   
   - I wrote the simplest dag (which only uses a shell-script-command with a 
delay)
   
   - Same approach - workerPod - restart the main airflow pod, and see an error 
message - workerPod seems to be an orphan
   
   Simplified DAG
   
   `airflow_adopt_pod_test_dag = DAG(
       dag_id='airflow_adopt_pod_test_dag',
       default_args=args,
       catchup=False,
       schedule_interval='5 4 * * *'
   )`
   
   `  task = KubernetesPodOperator(task="airflow-adopt-pod-test_process",
                                  namespace=OUR_CORRECT_NAME,
                                  service_account_name=OUR_CORRECT_NAME,
                                  
image=OUR_CORRECT_INTERNAL_ECR_REPO_WITH_ALPINE_IMAGE,
                                  cmds=["sh",
                                        "-c",
                                        "mkdir -p /airflow/xcom/;sleep 600;echo 
'[1,2,3,4]' > /airflow/xcom/return.json"],
                                  name="write-xcom",
                                  do_xcom_push=True,
                                  is_delete_operator_pod=True,
                                  in_cluster=True,
                                  task_id="write-xcom",
                                  get_logs=True,
                                  dag=dial_dag)`
   
   The fact that it has sleep 600 means that it sticks around for some minutes 
before that pod becomes Completed.
   
   `[2022-08-10 11:25:53,796] {kubernetes_executor.py:729} INFO - attempting to 
adopt pod airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d
   [2022-08-10 11:25:53,810] {kubernetes_executor.py:745} INFO - Failed to 
adopt pod airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d. 
Reason: (422)
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Pod
 \"airflowadoptpodtestdagwritexco-a27600cf4cf84f5b92a985f7e086057d\" is 
invalid: spec: Forbidden: pod updates may not change fields other than`
   
   - The Forbidden error message comes from Kubernetes. If you search online 
you can see that for a given Pod, it has a set of metadata, some of which is 
readonly, some of which is modifyable.
   
   - airflow source-code uses the kubernetes-client library underneath.
   
   - airflow source-code at the "Failed to adopt pod" part is 
   
   `            kube_client.patch_namespaced_pod(
                   name=pod.metadata.name,
                   namespace=pod.metadata.namespace,
                   body=PodGenerator.serialize_pod(pod),
               )`
   
   - Now, this function is provided by the kubernetes-client library, in there, 
we see that
   
   `        return self.api_client.call_api(
               '/api/v1/namespaces/{namespace}/pods/{name}', 'PATCH',
               path_params,
               query_params,
               header_params,
               body=body_params,
               post_params=form_params,
               files=local_var_files,
               response_type='V1Pod',  # noqa: E501
               auth_settings=auth_settings,
               async_req=local_var_params.get('async_req'),
               
_return_http_data_only=local_var_params.get('_return_http_data_only'),  # noqa: 
E501
               _preload_content=local_var_params.get('_preload_content', True),
               _request_timeout=local_var_params.get('_request_timeout'),
               collection_formats=collection_formats)`
   see that it does a PATCH, i.e. it is asking kubernetes to 
modify/merge/update the metadata of that particular pod. And, as i explained, 
some of this metadata is readonly in kubernetes. Hence, the error condition i 
outlined above.
   
   So my question is, do we have any way to solve this ? Also, why are we 
patching a pod in order to adopt it ?
   For me, it seems like a bug in the kubernetes-client library.
   The side effect of this, means that we could end up with a growing number of 
orphan pods in the k8s-namespace.
   
   I'll monitor this over the coming few days. (orphan pods).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to