jekeanyanwu opened a new issue, #66715:
URL: https://github.com/apache/airflow/issues/66715

   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   `apache-airflow-providers-cncf-kubernetes` (reproduced against current 
`main`; same code path exists at least back to 10.x).
   
   ### Apache Airflow version
   
   Reproduced on 2.x with provider 10.x; the buggy code path also exists on 
`main`.
   
   ### What happened
   
   When `KubernetesPodOperator` runs in deferrable mode and the Kubernetes 
garbage collector deletes the pod in the window between the trigger firing a 
success/error/timeout event and the worker re-entering the task, 
`trigger_reentry` crashes.
   
   The unguarded `self.pod = self.hook.get_pod(pod_name, pod_namespace)` call 
raises `kubernetes.client.exceptions.ApiException: (404)` and that exception 
escapes `trigger_reentry`. On older provider versions where `_clean` does not 
yet guard `self.pod is None`, the `finally` block additionally crashes with 
`AttributeError: 'NoneType' object has no attribute 'metadata'`, masking the 
original cause. Either way, a task whose pod actually **completed 
successfully** is marked failed and retried.
   
   Real-world traceback we hit (one of many; the cluster reclaims completed 
pods aggressively):
   
   ```
   File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in 
trigger_reentry
       self.pod = self.hook.get_pod(pod_name, pod_namespace)
   ...
   kubernetes.client.exceptions.ApiException: (404)
   Reason: Not Found
   HTTP response body: 
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
     "message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" 
not found",
     "reason":"NotFound","details":{"name":"...","kind":"pods"},"code":404}
   
   During handling of the above exception, another exception occurred:
   
   File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 895, in 
trigger_reentry
       self._clean(event, context)
   File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 905, in 
_clean
       self.pod = self.pod_manager.await_pod_completion(
   File ".../airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 808, 
in read_pod
       return self._client.read_namespaced_pod(pod.metadata.name, 
pod.metadata.namespace)
   AttributeError: 'NoneType' object has no attribute 'metadata'
   ```
   
   The trigger had already emitted `status: success` — the pod ran to 
completion, was GC'd, and then the worker resumed and tried to fetch it.
   
   ### What you think should happen instead
   
   The author's clear intent in the existing code is to translate a missing pod 
into the operator-level `PodNotFoundException`:
   
   ```python
   self.pod = self.hook.get_pod(pod_name, pod_namespace)
   
   if not self.pod:
       raise PodNotFoundException("Could not find pod after resuming from 
deferral")
   ```
   
   But `hook.get_pod()` never returns `None` — it raises `ApiException(404)`, 
so the `if not self.pod` branch is dead code. The 404 escapes uncaught.
   
   PR #39296 added `(HTTPError, ApiException)` handling around `_write_logs()`, 
and PR #56976 added a `self.pod is None` early-return in `_clean`. Neither of 
those help here: both run *after* the unguarded `get_pod()` call.
   
   Proposed fix: wrap the `get_pod()` call in `try/except ApiException` and:
   - On non-404, re-raise unchanged.
   - On 404 + `event["status"] == "success"`: log a warning and return (the pod 
already completed successfully per the trigger — logs/XCom are unrecoverable 
but the task succeeded).
   - On 404 + non-success event: raise `PodNotFoundException` (matches the 
existing dead-code intent).
   
   I have a draft PR with the fix and unit tests linked below.
   
   ### How to reproduce
   
   Run any `KubernetesPodOperator` with `deferrable=True` against a cluster 
that GCs pods aggressively (or just `kubectl delete pod` the running pod 
between the trigger firing `success` and the worker re-entering). The worker 
resumes, `get_pod` returns 404, and the task fails with the traceback above. We 
see this routinely under GKE Autopilot when daemonsets preempt nodes hosting 
just-completed task pods.
   
   ### Anything else
   
   Affects any deployment that uses `KubernetesPodOperator` in deferrable mode 
with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS 
with node-pressure eviction, preemption-by-priority-class). Workaround until 
this is fixed is a `SafeKubernetesPodOperator` subclass that catches 
`ApiException(404)` (and `AttributeError` on older providers) in 
`trigger_reentry`.
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR! (draft PR linked in a comment once 
opened.)
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to