SameerMesiah97 opened a new pull request, #60532:
URL: https://github.com/apache/airflow/pull/60532

   **Description**
   
   This change refactors `watch_pod_events` so that it continues watching 
events for the full lifecycle of the target pod, rather than stopping after a 
single watch stream terminates.
   
   The new implementation now:
   
   - Reconnects automatically when a watch stream terminates (e.g. server-side 
timeout).
   - Resumes watching from the last observed resourceVersion.
   - Handles Kubernetes 410 Gone errors by restarting the watch from the 
current state.
   - Terminates cleanly when the pod completes or is deleted.
   
   This ensures that `watch_pod_events` continues yielding events for the full 
lifecycle of a pod instead of silently stopping after `timeout_seconds`.
   
   **Rationale**
   
   The Kubernetes Watch API enforces server-side timeouts, meaning a single 
watch stream is not guaranteed to remain open indefinitely. The previous 
implementation treated timeout_seconds as an implicit upper bound on the total 
duration of event streaming, causing the generator to stop yielding events 
after the first watch termination — even while the pod was still running.
   
   This behavior is surprising and contradicts what users reasonably expect 
from the method name (`watch_pod_events`), the docstring and standard 
Kubernetes watch semantics. The updated implementation aligns with Kubernetes 
best practices by treating watch termination as a recoverable condition and 
transparently reconnecting until the pod reaches a terminal lifecycle state.
   
   **Backwards Compatibility**
   
   This change does **not** alter the public API or method signature. However, 
it does change runtime behavior:
   
   - `timeout_seconds` now applies only to individual watch connections, not 
the overall duration of event streaming.
   - Event streaming continues until pod completion or deletion instead of 
stopping silently after a timeout.
   
   While it is possible that some users rely on the previous behavior, it is 
more likely that existing deployments have implemented workarounds (e.g. 
external loops or polling) to compensate for the premature termination. The new 
behavior is consistent with documented intent and Kubernetes conventions, and 
therefore adheres to the principle of least surprise.
   
   **Tests**
   
   Added unit tests to validate the following expected behaviors:
   
   - Reconnects and continues streaming events after a watch stream ends (e.g. 
timeout).
   - Restarts the watch when Kubernetes returns 410 Gone due to a stale 
resourceVersion.
   - Stops cleanly when the pod is deleted (404).
   - Stops immediately when the pod reaches a terminal phase (Succeeded or 
Failed).
   
   Existing tests have been updated to account for the addition of pod state 
inspection in `watch_pod_events`.
   
   **Notes**
   
   - `_load_config` is now cached and is responsible only for loading 
configuration; it no longer returns an API client. API client instantiation is 
now solely the responsibility of `get_conn`, enabling reconnection in 
`watch_pod_events` without redundant configuration reloads. The internal helper 
used to construct and return an API client from `_load_config` has been removed.
   - The exception message raised when multiple configuration sources are 
supplied has been clarified to more accurately describe the error.
   - Polling fallback behavior is preserved and now continues until the pod 
reaches a terminal lifecycle state, matching the updated watch semantics.
   
   Closes: #60495


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to