jekeanyanwu opened a new pull request, #66716:
URL: https://github.com/apache/airflow/pull/66716

   <!-- closes: #ISSUE -->
   related: #66715
   
   ## Problem
   
   When `KubernetesPodOperator` runs in deferrable mode and the Kubernetes 
garbage collector deletes the pod in the window between the trigger firing a 
success/error/timeout event and the worker re-entering the task, 
`trigger_reentry` crashes.
   
   The unguarded `self.pod = self.hook.get_pod(pod_name, pod_namespace)` raises 
`ApiException(404)` and escapes `trigger_reentry`. On provider versions before 
#56976 (which added `if self.pod is None: return` to `_clean`), the `finally` 
block additionally crashes with `AttributeError: 'NoneType' object has no 
attribute 'metadata'`, masking the original cause.
   
   The existing dead-code branch right below the call:
   
   ```python
   self.pod = self.hook.get_pod(pod_name, pod_namespace)
   
   if not self.pod:
       raise PodNotFoundException("Could not find pod after resuming from 
deferral")
   ```
   
   was clearly intended to handle this, but `hook.get_pod()` raises rather than 
returning `None`, so the translation never happens.
   
   Real-world traceback (we hit this routinely on a Kubernetes cluster that 
aggressively reclaims completed pods):
   
   ```
   File ".../operators/pod.py", line 834, in trigger_reentry
       self.pod = self.hook.get_pod(pod_name, pod_namespace)
   kubernetes.client.exceptions.ApiException: (404) Not Found
   {"message":"pods \"load-chiba-lotte-marines-player-tracking-bq-t0x42m45\" 
not found", ...}
   
   During handling of the above exception, another exception occurred:
   ...
   File ".../operators/pod.py", line 905, in _clean
       self.pod = self.pod_manager.await_pod_completion(...)
   File ".../utils/pod_manager.py", line 808, in read_pod
       return self._client.read_namespaced_pod(pod.metadata.name, 
pod.metadata.namespace)
   AttributeError: 'NoneType' object has no attribute 'metadata'
   ```
   
   The trigger had already emitted `status: success` — the pod ran to 
completion successfully, was GC'd, and the worker resumed only to fail the task.
   
   ## Solution
   
   Wrap the `get_pod` call so that:
   
   - **Non-404 `ApiException`** re-raises unchanged.
   - **404 + `event["status"] == "success"`** logs a warning and returns. The 
trigger already observed the pod completed successfully; logs/XCom are 
unrecoverable but the task itself succeeded, so retrying is wrong.
   - **404 + non-success event** raises `PodNotFoundException`, matching the 
existing dead-code intent.
   
   The pre-existing `if not self.pod:` branch is kept as a defensive guard for 
any subclass override that returns `None` instead of raising.
   
   ## Tests
   
   Three new unit tests in `TestKubernetesPodOperatorAsync` covering the three 
branches:
   
   - `test_async_trigger_reentry_returns_when_pod_gcd_on_success`
   - `test_async_trigger_reentry_raises_pod_not_found_on_failure`
   - `test_async_trigger_reentry_propagates_non_404_api_exception`
   
   ```
   uv run --project providers/cncf/kubernetes pytest \
     
providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_pod.py::TestKubernetesPodOperatorAsync
 \
     -k trigger_reentry -q
   ```
   
   ## Related prior work
   
   - #39296 added `(HTTPError, ApiException)` handling around `_write_logs()` — 
runs after the unguarded `get_pod()`, doesn't help.
   - #56976 added `if self.pod is None: return` to `_clean` — runs after, 
doesn't help.
   - This PR closes the remaining gap at the unguarded `get_pod()` call itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to