davidharkis commented on issue #47283:
URL: https://github.com/apache/airflow/issues/47283#issuecomment-2726370943
We see a variation of this during the completion of deferred tasks
KubernetesPodOperator, `apache-airflow-providers-cncf-kubernetes` v10.3.1
It only appears to manifest for us running larger numbers of task instances
(e.g couple of hundred), it is almost never more than one instance impacted out
of the run.
For example, most instances will finish like this
`[2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24
Completed successfully
[2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24
[2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24 Done.
PASS=193 WARN=0 ERROR=0 SKIP=0 TOTAL=193
[2025-02-03, 05:06:47 UTC] {pod.py:988} INFO - Deleting pod:
dbt-cluid-stats-v7-experiment-p89rn0vg
[2025-02-03, 05:06:47 UTC] {taskinstance.py:340} ▼ Post task execution logs
[2025-02-03, 05:06:47 UTC] {taskinstance.py:352} INFO - Marking task as
SUCCESS. dag_id=core-identity-cluid-stats-daily,
task_id=dbt-cluid-stats-v7-experiment,
run_id=dataset_triggered__2025-02-03T04:15:52.827850+00:00, map_index=1,
execution_date=20250203T041552, start_date=20250203T045254,
end_date=20250203T050647
[2025-02-03, 05:06:48 UTC] {local_task_job_runner.py:266} INFO - Task exited
with return code 0
[2025-02-03, 05:06:48 UTC] {taskinstance.py:3895} INFO - 0 downstream tasks
scheduled from follow-on schedule check
[2025-02-03, 05:06:48 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end`
but one will finish with
```
[2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20
Completed successfully
[2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20
[2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20 Done.
PASS=193 WARN=0 ERROR=0 SKIP=0 TOTAL=193
[2025-02-03, 05:15:22 UTC] {pod_manager.py:622} INFO - Pod
dbt-cluid-stats-v7-experiment-nl1alqg7 has phase Running
[2025-02-03, 05:15:24 UTC] {pod.py:988} INFO - Deleting pod:
dbt-cluid-stats-v7-experiment-nl1alqg7
[2025-02-03, 05:15:24 UTC] {taskinstance.py:3311} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 767, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 733, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
line 1804, in resume_execution
return execute_callable(context)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/code/id5/kubernetes/operator/pod.py", line 86, in
trigger_reentry
raise AirflowException(message)
airflow.exceptions.AirflowException: Traceback (most recent call last):
File "/opt/code/id5/kubernetes/trigger/pod.py", line 114, in run
event = await self._wait_for_container_completion()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
line 246, in _wait_for_container_completion
pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
line 756, in get_pod
pod: V1Pod = await v1_api.read_namespaced_pod(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 192, in __call_api
raise e
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 185, in __call_api
response_data = await self.request(
^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 195, in GET
return (await self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 189, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: <CIMultiDictProxy('Audit-Id':
'00b7a901-68cc-407c-a1c7-1dddba7739bb', 'Cache-Control': 'no-cache, private',
'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid':
'83f2afc0-22a9-4eef-adaa-d9af4d5febf6', 'X-Kubernetes-Pf-Prioritylevel-Uid':
'd0830089-6ed9-4917-90ac-dc395de94ce8', 'Date': 'Mon, 03 Feb 2025 05:07:37
GMT', 'Content-Length': '244')>
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
\"dbt-cluid-stats-v7-experiment-nl1alqg7\" not
found","reason":"NotFound","details":{"name":"dbt-cluid-stats-v7-experiment-nl1alqg7","kind":"pods"},"code":404}
[2025-02-03, 05:15:24 UTC] {taskinstance.py:1225} INFO - Marking task as
UP_FOR_RETRY. dag_id=core-identity-cluid-stats-daily,
task_id=dbt-cluid-stats-v7-experiment,
run_id=dataset_triggered__2025-02-03T04:15:52.827850+00:00, map_index=115,
execution_date=20250203T041552, start_date=20250203T050730,
end_date=20250203T051524
[2025-02-03, 05:15:24 UTC] {taskinstance.py:340} ▶ Post task execution logs
```
The pod is being deleted while the triggerer is still checking it. We've
tried any number of things to stop this, numbers of triggerers, scaling,
dropping pgbouncer, even overloading `wait_for_container_completion` to handle
it, which is possible, but then results in a follow on failure with
`_wait_for_pod_start`
`[2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01
Completed successfully
[2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01
[2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01 Done.
PASS=39 WARN=0 ERROR=0 SKIP=0 TOTAL=39
[2025-03-15, 04:14:04 UTC] {pod_manager.py:654} INFO - Pod
dbt-cluid-stats-v6-public-in22mzyx has phase Running
[2025-03-15, 04:14:06 UTC] {pod.py:997} INFO - Deleting pod:
dbt-cluid-stats-v6-public-in22mzyx
[2025-03-15, 04:14:06 UTC] {taskinstance.py:3313} ERROR - Task failed with
exception
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 768, in _execute_task
result = _execute_callable(context=context, **execute_callable_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
line 734, in _execute_callable
return ExecutionCallableRunner(
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
line 252, in run
return self.func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
line 1816, in resume_execution
return execute_callable(context)
^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
line 790, in trigger_reentry
raise AirflowException(message)
airflow.exceptions.AirflowException: Traceback (most recent call last):
File "/opt/code/id5/kubernetes/trigger/pod.py", line 198, in run
state = await self._wait_for_pod_start()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/code/id5/kubernetes/trigger/pod.py", line 111, in
_wait_for_pod_start
pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
line 821, in get_pod
pod: V1Pod = await v1_api.read_namespaced_pod(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 192, in __call_api
raise e
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
line 185, in __call_api
response_data = await self.request(
^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 195, in GET
return (await self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
line 189, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: <CIMultiDictProxy('Audit-Id':
'efffc385-920d-401d-b74a-0612f652daa2', 'Cache-Control': 'no-cache, private',
'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid':
'83f2afc0-22a9-4eef-adaa-d9af4d5febf6', 'X-Kubernetes-Pf-Prioritylevel-Uid':
'd0830089-6ed9-4917-90ac-dc395de94ce8', 'Date': 'Sat, 15 Mar 2025 04:11:06
GMT', 'Content-Length': '236')>
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
\"dbt-cluid-stats-v6-public-in22mzyx\" not
found","reason":"NotFound","details":{"name":"dbt-cluid-stats-v6-public-in22mzyx","kind":"pods"},"code":404}
[2025-03-15, 04:14:06 UTC] {taskinstance.py:1226} INFO - Marking task as
UP_FOR_RETRY. dag_id=core-identity-cluid-stats-daily,
task_id=dbt-cluid-stats-v6-public,
run_id=dataset_triggered__2025-03-15T04:09:28.081804+00:00, map_index=42,
execution_date=20250315T040928, start_date=20250315T041037,
end_date=20250315T041406
[2025-03-15, 04:14:06 UTC] {taskinstance.py:341} ▶ Post task execution logs`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]