mtilda opened a new pull request, #25882:
URL: https://github.com/apache/airflow/pull/25882
## Changes
This PR adds a condition to the method `find_pod` to verify that the name of
the pod matches the name defined by the user. This will ensure two pods with
different names but identical context do not collide.
## Issue
Two (or more) pods created with `KubernetesPodOperator` from identical
context (`namespace`, `dag_id`, `task_id`, `run_id`, and `map_index`) across
different Airflow environments (e.g. staging + production) collide causing race
conditions. This happens because the method `find_pods` in
`KubernetesPodOperator` sees these pods as identical.
At first, we thought this was a name collision, so we prepended the string
`stg-`, or `prd-` (based on an Airflow Variable) to the pod name. It seems this
did not fix our problem; however it made the problem easier to debug.
As you can see in the logs below, instead of creating the pod
`prd-example-9ba5a14b73ab41988c1c57afe5ec81f4`, KPO found the "matching" pod
`stg-example-ce40ace00c704b87a56241b87bc4ff47`.
```
...
[2022-08-10, 07:56:49 EDT] {kubernetes_pod.py:221} INFO - Creating pod
prd-example-9ba5a14b73ab41988c1c57afe5ec81f4 with labels: {'dag_id':
'example_k8s_pod_operator', 'task_id': 'run_k8s_pod', 'run_id':
'scheduled__2022-08-09T1145000000-c7f22427c', 'kubernetes_pod_operator':
'True', 'try_number': '4'}
[2022-08-10, 07:56:49 EDT] {kubernetes_pod.py:366} INFO - Found matching pod
stg-example-9ba5a14b73ab41988c1c57afe5ec81f4 with labels {'airflow_version':
'2.3.2-astro.2', 'dag_id': 'example_k8s_pod_operator',
'kubernetes_pod_operator': 'True', 'run_id':
'scheduled__2022-08-09T1145000000-c7f22427c', 'task_id': 'run_k8s_pod',
'try_number': '4'}
[2022-08-10, 07:56:49 EDT] {kubernetes_pod.py:367} INFO - `try_number` of
task_instance: 4
[2022-08-10, 07:56:49 EDT] {kubernetes_pod.py:368} INFO - `try_number` of
pod: 4
...
```
We would like to see the `find_pod` method fail to find a pod in this
situation, because the name does not match what the user defined.
## Test
Run two Airflow projects with identical DAGs (same `dag_id` and schedule
args) with an identical task (same `task_id`) like below:
```py
with DAG(
"example_k8s_pod_operator"
description="Runs scripts in Kubernetes Pods",
default_args={
"retries": 1,
"retry_delay": timedelta(seconds=10),
"trigger_rule": "all_done",
},
start_date=datetime(2022, 1, 1),
schedule_interval=None,
) as dag:
KubernetesPodOperator(
task_id="run_k8s_pod",
arguments=["Hello world!"],
cmds=["echo"],
get_logs=True,
image="ubuntu",
# in_cluster=False, # We use a different cluster, but you should
not need to
is_delete_operator_pod=True,
kubernetes_conn_id="GKE_CONFIG",
namespace="default",
name="example", # Change this manually or with Airflow Variable
)
```
Change the argument `name`, so it is unique across your Airflow instances.
Trigger these DAGs at approximately the same time. Then observe logs for
race condition behavior described above.
---
## Astronomer Support
This relates to [Astronomer Support ticket
#11647](https://support.astronomer.io/hc/en-us/requests/11647).
## Notes
I am a new contributor, so feel free to educate me on contribution best
practices.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]