BillSullivan2020 opened a new issue, #24015:
URL: https://github.com/apache/airflow/issues/24015
### Apache Airflow version
2.3.0
### What happened
Here i provide steps to reproduce this.
Goal of this: to describe how to reproduce the "Failed to Adopt pod" error
condition.
The DAG->step Described Below should be of type KubernetesPodOperator
NOTE: under normal operation,
(where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this
edge-case)
(it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is
suddenly restarted/stopped)
(that we would see orphan->workerPods)
1] Implement a contrived-DAG, with a single step -> which is long-running
(e.g. 6 minutes)
2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG
3] Run your contrived-DAG.
4] in the middle of running the single-step, check via "kubectl" that your
Kubernetes->workerPod has been created / running
5] while workerPod still running, do "kubectl delete pod
<OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan.
6] the workerPod still continues to run through to completion. after which
the K8S->status of the pod will be Completed, however the pod doesn't shut down
itself.
7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running
again.
8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again
9] while the contrived-DAG is starting/tryingToStart etc, you will see in
the logs printed out "Failed to adopt pod" -> with 422 error code.
The step-9 with the error message, you will find two appearances of this
error msg in the airflow-2.1.4, airflow-2.3.0 source-code.
The step-7 may also - general logging from the MAIN_APP - may also output
the "Failed to adopt pod" error message also.
### What you think should happen instead
On previous versions of airflow e.g. 1.10.x, the orphan-workerPods would be
adopted by the 2nd run-time of the airflowMainApp and either used to continue
the same DAG and/or cleared away when complete.
This is not happening with the newer airflow 2.1.4 / 2.3.0 (presumably
because the code changed), and upon the 2nd run-time of the airflowMainApp - it
would seem to try to adopt-workerPod but fails at that point ("Failed to adopt
pod" in the logs and hence it cannot clear away orphan pods).
Given this is an edge-case only, (i.e. we would not expect k8s to be
recycling the main airflowApp/pod anyway), it doesn't seem totally urgent bug.
However, the only reason for me raising this issue with yourselves is that
given any k8s->namespace, in particular in PROD, over time (e.g. 1 month?)
the namespace will slowly be being filled up with orphanPods and somebody would
need to manually log-in to delete old pods.
### How to reproduce
Here i provide steps to reproduce this.
Goal of this: to describe how to reproduce the "Failed to Adopt pod" error
condition.
The DAG->step Described Below should be of type KubernetesPodOperator
NOTE: under normal operation,
(where the MAIN_AIRFLOW_POD is never recycled by k8s, we will never see this
edge-case)
(it is only when the workerPod is still running, but the MAIN_AIRFLOW_POD is
suddenly restarted/stopped)
(that we would see orphan->workerPods)
1] Implement a contrived-DAG, with a single step -> which is long-running
(e.g. 6 minutes)
2] Deploy your airflow-2.1.4 / airfow-2.3.0 together with the contrived-DAG
3] Run your contrived-DAG.
4] in the middle of running the single-step, check via "kubectl" that your
Kubernetes->workerPod has been created / running
5] while workerPod still running, do "kubectl delete pod
<OF_MAIN_AIRFLOW_POD>". This will mean that the workerPod becomes an orphan.
6] the workerPod still continues to run through to completion. after which
the K8S->status of the pod will be Completed, however the pod doesn't shut down
itself.
7] "kubectl" start up a new <MAIN_AIRFLOW_POD> so the web-ui is running
again.
8] MAIN_AIRFLOW_POD->webUi - Run your contrived-DAG again
9] while the contrived-DAG is starting/tryingToStart etc, you will see in
the logs printed out "Failed to adopt pod" -> with 422 error code.
The step-9 with the error message, you will find two appearances of this
error msg in the airflow-2.1.4, airflow-2.3.0 source-code.
The step-7 may also - general logging from the MAIN_APP - may also output
the "Failed to adopt pod" error message also.
### Operating System
kubernetes
### Versions of Apache Airflow Providers
_No response_
### Deployment
Other 3rd-party Helm chart
### Deployment details
nothing special.
it (CI/CD pipeline) builds the app. using requirements.txt to pull-in all
the required python dependencies (including there is a dependency for the
airflow-2.1.4 / 2.3.0)
it (CI/CD pipeline) packages the app as an ECR image & then deploy directly
to k8s namespace.
### Anything else
this is 100% reproducible each & every time.
i have tested this multiple times.
also - i tested this on the old airflow-1.10.x a couple of times to verify
that the bug did not exist previously
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]