davidharkis commented on issue #47283:
URL: https://github.com/apache/airflow/issues/47283#issuecomment-2726370943

   We see a variation of this during the completion of deferred tasks 
KubernetesPodOperator, `apache-airflow-providers-cncf-kubernetes` v10.3.1
   
   It only appears to manifest for us running larger numbers of task instances 
(e.g couple of hundred), it is almost never more than one instance impacted out 
of the run.
   
   For example, most instances will finish like this
   
   `[2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24  
Completed successfully
   [2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24  
   [2025-02-03, 05:06:47 UTC] {pod.py:856} INFO - [base] logs: 05:06:24  Done. 
PASS=193 WARN=0 ERROR=0 SKIP=0 TOTAL=193
   [2025-02-03, 05:06:47 UTC] {pod.py:988} INFO - Deleting pod: 
dbt-cluid-stats-v7-experiment-p89rn0vg
   [2025-02-03, 05:06:47 UTC] {taskinstance.py:340} ▼ Post task execution logs
   [2025-02-03, 05:06:47 UTC] {taskinstance.py:352} INFO - Marking task as 
SUCCESS. dag_id=core-identity-cluid-stats-daily, 
task_id=dbt-cluid-stats-v7-experiment, 
run_id=dataset_triggered__2025-02-03T04:15:52.827850+00:00, map_index=1, 
execution_date=20250203T041552, start_date=20250203T045254, 
end_date=20250203T050647
   [2025-02-03, 05:06:48 UTC] {local_task_job_runner.py:266} INFO - Task exited 
with return code 0
   [2025-02-03, 05:06:48 UTC] {taskinstance.py:3895} INFO - 0 downstream tasks 
scheduled from follow-on schedule check
   [2025-02-03, 05:06:48 UTC] {local_task_job_runner.py:245} ▲▲▲ Log group end`
   
   but one will finish with
   
   ```
   [2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20  
Completed successfully
   [2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20  
   [2025-02-03, 05:15:20 UTC] {pod.py:856} INFO - [base] logs: 05:15:20  Done. 
PASS=193 WARN=0 ERROR=0 SKIP=0 TOTAL=193
   [2025-02-03, 05:15:22 UTC] {pod_manager.py:622} INFO - Pod 
dbt-cluid-stats-v7-experiment-nl1alqg7 has phase Running
   [2025-02-03, 05:15:24 UTC] {pod.py:988} INFO - Deleting pod: 
dbt-cluid-stats-v7-experiment-nl1alqg7
   [2025-02-03, 05:15:24 UTC] {taskinstance.py:3311} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 767, in _execute_task
       result = _execute_callable(context=context, **execute_callable_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 733, in _execute_callable
       return ExecutionCallableRunner(
              ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
 line 252, in run
       return self.func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
 line 1804, in resume_execution
       return execute_callable(context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/opt/code/id5/kubernetes/operator/pod.py", line 86, in 
trigger_reentry
       raise AirflowException(message)
   airflow.exceptions.AirflowException: Traceback (most recent call last):
     File "/opt/code/id5/kubernetes/trigger/pod.py", line 114, in run
       event = await self._wait_for_container_completion()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/triggers/pod.py",
 line 246, in _wait_for_container_completion
       pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
 line 756, in get_pod
       pod: V1Pod = await v1_api.read_namespaced_pod(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 192, in __call_api
       raise e
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 185, in __call_api
       response_data = await self.request(
                       ^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
 line 195, in GET
       return (await self.request("GET", url,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
 line 189, in request
       raise ApiException(http_resp=r)
   kubernetes_asyncio.client.exceptions.ApiException: (404)
   Reason: Not Found
   HTTP response headers: <CIMultiDictProxy('Audit-Id': 
'00b7a901-68cc-407c-a1c7-1dddba7739bb', 'Cache-Control': 'no-cache, private', 
'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 
'83f2afc0-22a9-4eef-adaa-d9af4d5febf6', 'X-Kubernetes-Pf-Prioritylevel-Uid': 
'd0830089-6ed9-4917-90ac-dc395de94ce8', 'Date': 'Mon, 03 Feb 2025 05:07:37 
GMT', 'Content-Length': '244')>
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
 \"dbt-cluid-stats-v7-experiment-nl1alqg7\" not 
found","reason":"NotFound","details":{"name":"dbt-cluid-stats-v7-experiment-nl1alqg7","kind":"pods"},"code":404}
   [2025-02-03, 05:15:24 UTC] {taskinstance.py:1225} INFO - Marking task as 
UP_FOR_RETRY. dag_id=core-identity-cluid-stats-daily, 
task_id=dbt-cluid-stats-v7-experiment, 
run_id=dataset_triggered__2025-02-03T04:15:52.827850+00:00, map_index=115, 
execution_date=20250203T041552, start_date=20250203T050730, 
end_date=20250203T051524
   [2025-02-03, 05:15:24 UTC] {taskinstance.py:340} ▶ Post task execution logs
   ```
   
   The pod is being deleted while the triggerer is still checking it. We've 
tried any number of things to stop this, numbers of triggerers, scaling, 
dropping pgbouncer, even overloading `wait_for_container_completion` to handle 
it, which is possible, but then results in a follow on failure with 
`_wait_for_pod_start`
   
   `[2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01  
Completed successfully
   [2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01  
   [2025-03-15, 04:14:01 UTC] {pod.py:865} INFO - [base] logs: 04:14:01  Done. 
PASS=39 WARN=0 ERROR=0 SKIP=0 TOTAL=39
   [2025-03-15, 04:14:04 UTC] {pod_manager.py:654} INFO - Pod 
dbt-cluid-stats-v6-public-in22mzyx has phase Running
   [2025-03-15, 04:14:06 UTC] {pod.py:997} INFO - Deleting pod: 
dbt-cluid-stats-v6-public-in22mzyx
   [2025-03-15, 04:14:06 UTC] {taskinstance.py:3313} ERROR - Task failed with 
exception
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 768, in _execute_task
       result = _execute_callable(context=context, **execute_callable_kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/taskinstance.py",
 line 734, in _execute_callable
       return ExecutionCallableRunner(
              ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/utils/operator_helpers.py",
 line 252, in run
       return self.func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/models/baseoperator.py",
 line 1816, in resume_execution
       return execute_callable(context)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py",
 line 790, in trigger_reentry
       raise AirflowException(message)
   airflow.exceptions.AirflowException: Traceback (most recent call last):
     File "/opt/code/id5/kubernetes/trigger/pod.py", line 198, in run
       state = await self._wait_for_pod_start()
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/opt/code/id5/kubernetes/trigger/pod.py", line 111, in 
_wait_for_pod_start
       pod = await self.hook.get_pod(self.pod_name, self.pod_namespace)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/cncf/kubernetes/hooks/kubernetes.py",
 line 821, in get_pod
       pod: V1Pod = await v1_api.read_namespaced_pod(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 192, in __call_api
       raise e
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/api_client.py",
 line 185, in __call_api
       response_data = await self.request(
                       ^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
 line 195, in GET
       return (await self.request("GET", url,
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.12/site-packages/kubernetes_asyncio/client/rest.py",
 line 189, in request
       raise ApiException(http_resp=r)
   kubernetes_asyncio.client.exceptions.ApiException: (404)
   Reason: Not Found
   HTTP response headers: <CIMultiDictProxy('Audit-Id': 
'efffc385-920d-401d-b74a-0612f652daa2', 'Cache-Control': 'no-cache, private', 
'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 
'83f2afc0-22a9-4eef-adaa-d9af4d5febf6', 'X-Kubernetes-Pf-Prioritylevel-Uid': 
'd0830089-6ed9-4917-90ac-dc395de94ce8', 'Date': 'Sat, 15 Mar 2025 04:11:06 
GMT', 'Content-Length': '236')>
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods
 \"dbt-cluid-stats-v6-public-in22mzyx\" not 
found","reason":"NotFound","details":{"name":"dbt-cluid-stats-v6-public-in22mzyx","kind":"pods"},"code":404}
   [2025-03-15, 04:14:06 UTC] {taskinstance.py:1226} INFO - Marking task as 
UP_FOR_RETRY. dag_id=core-identity-cluid-stats-daily, 
task_id=dbt-cluid-stats-v6-public, 
run_id=dataset_triggered__2025-03-15T04:09:28.081804+00:00, map_index=42, 
execution_date=20250315T040928, start_date=20250315T041037, 
end_date=20250315T041406
   [2025-03-15, 04:14:06 UTC] {taskinstance.py:341} ▶ Post task execution logs`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to