AutomationDev85 opened a new pull request, #58397:
URL: https://github.com/apache/airflow/pull/58397

   # Overview
   
   We are encountering "Too Many Requests" (HTTP 429) errors from the 
Kubernetes API when scaling up nodes in our Kubernetes cluster. We introduced 
already the retry handling in the PodManager with this PR: 
https://github.com/apache/airflow/pull/58033
   but we found also the issue that the PodManager access the KuberentesHook 
api client direct and so the idea is to add the retry on the Hook level and 
catch only the spezific funtions on the PodManager with direct access to the 
API Client with retry handling. 
   
   This PR also changes the handling of the retries so that only retry worth 
statuscodes and errors are retried.
   
   We are encountering frequent HTTP 429 “Too Many Requests” responses from the 
Kubernetes API during node scale-up operations. A prior change (see PR 
https://github.com/apache/airflow/pull/58033) introduced retry handling in the 
PodManager. But some PodManager methods bypassed that logic by using the 
KubernetesHook API client directly. This change moves the primary retry 
mechanism into the KubernetesHook and adds targeted retries only for PodManager 
methods that invoke the API client directly.
   
   Retry behavior is refined to act only on retry-worthy status codes and 
errors.
   
   We welcome your feedback on this change!
   
   # Details of change:
   
   * Retry logic centralized at the KubernetesHook level.
   * PodManager now retries only for methods that directly invoke the 
Kubernetes API client.
   * Retries limited to transient, retry-worthy status codes and network errors.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to