jekeanyanwu opened a new issue, #66715:
URL: https://github.com/apache/airflow/issues/66715
### Apache Airflow Provider(s)
cncf-kubernetes
### Versions of Apache Airflow Providers
`apache-airflow-providers-cncf-kubernetes` (reproduced against current
`main`; same code path exists at least back to 10.x).
### Apache Airflow version
Reproduced on 2.x with provider 10.x; the buggy code path also exists on
`main`.
### What happened
When `KubernetesPodOperator` runs in deferrable mode and the Kubernetes
garbage collector deletes the pod in the window between the trigger firing a
success/error/timeout event and the worker re-entering the task,
`trigger_reentry` crashes.
The unguarded `self.pod = self.hook.get_pod(pod_name, pod_namespace)` call
raises `kubernetes.client.exceptions.ApiException: (404)` and that exception
escapes `trigger_reentry`. On older provider versions where `_clean` does not
yet guard `self.pod is None`, the `finally` block additionally crashes with
`AttributeError: 'NoneType' object has no attribute 'metadata'`, masking the
original cause. Either way, a task whose pod actually **completed
successfully** is marked failed and retried.
Real-world traceback we hit (one of many; the cluster reclaims completed
pods aggressively):
```
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 834, in
trigger_reentry
self.pod = self.hook.get_pod(pod_name, pod_namespace)
...
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response body:
{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure",
"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\"
not found",
"reason":"NotFound","details":{"name":"...","kind":"pods"},"code":404}
During handling of the above exception, another exception occurred:
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 895, in
trigger_reentry
self._clean(event, context)
File ".../airflow/providers/cncf/kubernetes/operators/pod.py", line 905, in
_clean
self.pod = self.pod_manager.await_pod_completion(
File ".../airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 808,
in read_pod
return self._client.read_namespaced_pod(pod.metadata.name,
pod.metadata.namespace)
AttributeError: 'NoneType' object has no attribute 'metadata'
```
The trigger had already emitted `status: success` — the pod ran to
completion, was GC'd, and then the worker resumed and tried to fetch it.
### What you think should happen instead
The author's clear intent in the existing code is to translate a missing pod
into the operator-level `PodNotFoundException`:
```python
self.pod = self.hook.get_pod(pod_name, pod_namespace)
if not self.pod:
raise PodNotFoundException("Could not find pod after resuming from
deferral")
```
But `hook.get_pod()` never returns `None` — it raises `ApiException(404)`,
so the `if not self.pod` branch is dead code. The 404 escapes uncaught.
PR #39296 added `(HTTPError, ApiException)` handling around `_write_logs()`,
and PR #56976 added a `self.pod is None` early-return in `_clean`. Neither of
those help here: both run *after* the unguarded `get_pod()` call.
Proposed fix: wrap the `get_pod()` call in `try/except ApiException` and:
- On non-404, re-raise unchanged.
- On 404 + `event["status"] == "success"`: log a warning and return (the pod
already completed successfully per the trigger — logs/XCom are unrecoverable
but the task succeeded).
- On 404 + non-success event: raise `PodNotFoundException` (matches the
existing dead-code intent).
I have a draft PR with the fix and unit tests linked below.
### How to reproduce
Run any `KubernetesPodOperator` with `deferrable=True` against a cluster
that GCs pods aggressively (or just `kubectl delete pod` the running pod
between the trigger firing `success` and the worker re-entering). The worker
resumes, `get_pod` returns 404, and the task fails with the traceback above. We
see this routinely under GKE Autopilot when daemonsets preempt nodes hosting
just-completed task pods.
### Anything else
Affects any deployment that uses `KubernetesPodOperator` in deferrable mode
with a cluster that aggressively reclaims completed pods (GKE Autopilot, EKS
with node-pressure eviction, preemption-by-priority-class). Workaround until
this is fixed is a `SafeKubernetesPodOperator` subclass that catches
`ApiException(404)` (and `AttributeError` on older providers) in
`trigger_reentry`.
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR! (draft PR linked in a comment once
opened.)
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]