AutomationDev85 opened a new pull request, #60254:
URL: https://github.com/apache/airflow/pull/60254

   # Overview
   
   We observed intermittent hangs in Kubernetes API communication from 
KubernetesPodOperator, with calls stalling for over a day until tasks were 
manually stopped. Investigation showed the Kubernetes Python client wasn’t 
using a client-side timeout, so stalled connections could block indefinitely. 
To improve robustness, we add a client-side timeout to API calls so they raise 
a clear exception instead of leaving tasks hanging. This does not fix the 
underlying cluster/API issue, but it makes failures detectable and recoverable.
   
   We chose a 60-second timeout: long enough to tolerate load, short enough to 
prevent indefinite hangs. Timeouts are applied per call because there’s no 
clean, consistent way to set this at client creation across sync/async and 
watch/exec paths.
   
   # Change Summary
   
   * Set a 60-second client-side timeout for Kubernetes API requests.
   * Apply the timeout to individual API calls to ensure stalled calls fail 
fast.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to