BillSullivan2020 opened a new issue, #24015:
URL: https://github.com/apache/airflow/issues/24015

   ### Apache Airflow version
   
   2.3.0
   
   ### What happened
   
   Here i provide steps to reproduce this.
   
   Goal of this: to describe how to reproduce the "Failed to Adopt pod" error 
condition.
   
   The DAG->step Described Below should be of type KubernetesPodOperator
   
   NOTE: under normal operation,
   (where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this 
edge-case)
   (it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is 
suddenly restarted/stopped)
   (that we would see orphan->workerPods)
   
   1] Implement a contrived-DAG, with a single step -> which is long-running 
(e.g. 6 minutes)
   2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG
   3] Run your contrived-DAG.
   4] in the middle of running the single-step, check via "kubectl" that your 
Kubernetes->workerPod has been created / running
   5] while workerPod still running, do "kubectl delete pod 
<OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan.
   6] the workerPod still continues to run through to completion. after which 
the K8S->status of the pod will be Completed, however the pod doesn't shut down 
itself.
   7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running 
again.
   8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again
   9] while the contrived-DAG is starting/tryingToStart etc, you will see in 
the logs printed out "Failed to adopt pod" -> with 422 error code.
   
   The step-9 with the error message, you will find two appearances of this 
error msg in the airflow-2.1.4, airflow-2.3.0 source-code.
   The step-7 may also - general logging from the MAIN_APP - may also output 
the "Failed to adopt pod" error message also.
   
   
   
   ### What you think should happen instead
   
   On previous versions of airflow e.g. 1.10.x, the orphan-workerPods would be 
adopted by the 2nd run-time of the airflowMainApp and either used to continue 
the same DAG and/or cleared away when complete.
   
   This is not happening with the newer airflow 2.1.4 / 2.3.0 (presumably 
because the code changed), and upon the 2nd run-time of the airflowMainApp - it 
would seem to try to adopt-workerPod but fails at that point ("Failed to adopt 
pod" in the logs and hence it cannot clear away orphan pods).
   
   Given this is an edge-case only, (i.e. we would not expect k8s to be 
recycling the main airflowApp/pod anyway), it doesn't seem totally urgent bug. 
However, the only reason for me raising this issue with yourselves is that 
given any k8s->namespace, in particular in PROD,   over time (e.g. 1 month?) 
the namespace will slowly be being filled up with orphanPods and somebody would 
need to manually log-in to delete old pods.
   
   ### How to reproduce
   
   Here i provide steps to reproduce this.
   
   Goal of this: to describe how to reproduce the "Failed to Adopt pod" error 
condition.
   
   The DAG->step Described Below should be of type KubernetesPodOperator
   
   NOTE: under normal operation,
   (where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this 
edge-case)
   (it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is 
suddenly restarted/stopped)
   (that we would see orphan->workerPods)
   
   1] Implement a contrived-DAG, with a single step -> which is long-running 
(e.g. 6 minutes)
   2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG
   3] Run your contrived-DAG.
   4] in the middle of running the single-step, check via "kubectl" that your 
Kubernetes->workerPod has been created / running
   5] while workerPod still running, do "kubectl delete pod 
<OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan.
   6] the workerPod still continues to run through to completion. after which 
the K8S->status of the pod will be Completed, however the pod doesn't shut down 
itself.
   7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running 
again.
   8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again
   9] while the contrived-DAG is starting/tryingToStart etc, you will see in 
the logs printed out "Failed to adopt pod" -> with 422 error code.
   
   The step-9 with the error message, you will find two appearances of this 
error msg in the airflow-2.1.4, airflow-2.3.0 source-code.
   The step-7 may also - general logging from the MAIN_APP - may also output 
the "Failed to adopt pod" error message also.
   
   
   
   ### Operating System
   
   kubernetes
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   nothing special.
   
   it (CI/CD pipeline) builds the app. using requirements.txt to pull-in all 
the required python dependencies (including there is a dependency for the 
airflow-2.1.4 / 2.3.0)
   
   it (CI/CD pipeline) packages the app as an ECR image & then deploy directly 
to k8s namespace.
   
   ### Anything else
   
   this is 100% reproducible each & every time.
   i have tested this multiple times.
   
   also - i tested this on the old airflow-1.10.x a couple of times to verify 
that the bug did not exist previously
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to