AutomationDev85 opened a new pull request, #58397: URL: https://github.com/apache/airflow/pull/58397
# Overview We are encountering "Too Many Requests" (HTTP 429) errors from the Kubernetes API when scaling up nodes in our Kubernetes cluster. We introduced already the retry handling in the PodManager with this PR: https://github.com/apache/airflow/pull/58033 but we found also the issue that the PodManager access the KuberentesHook api client direct and so the idea is to add the retry on the Hook level and catch only the spezific funtions on the PodManager with direct access to the API Client with retry handling. This PR also changes the handling of the retries so that only retry worth statuscodes and errors are retried. We are encountering frequent HTTP 429 “Too Many Requests” responses from the Kubernetes API during node scale-up operations. A prior change (see PR https://github.com/apache/airflow/pull/58033) introduced retry handling in the PodManager. But some PodManager methods bypassed that logic by using the KubernetesHook API client directly. This change moves the primary retry mechanism into the KubernetesHook and adds targeted retries only for PodManager methods that invoke the API client directly. Retry behavior is refined to act only on retry-worthy status codes and errors. We welcome your feedback on this change! # Details of change: * Retry logic centralized at the KubernetesHook level. * PodManager now retries only for methods that directly invoke the Kubernetes API client. * Retries limited to transient, retry-worthy status codes and network errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
