jscheffl commented on code in PR #58033:
URL: https://github.com/apache/airflow/pull/58033#discussion_r2505634128


##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py:
##########
@@ -78,7 +78,8 @@ class PodLaunchFailedException(AirflowException):
 def should_retry_start_pod(exception: BaseException) -> bool:
     """Check if an Exception indicates a transient error and warrants 
retrying."""
     if isinstance(exception, ApiException):
-        return str(exception.status) == "409"
+        # Retry on status code 409 (Conflict) or 429 (Too Many Requests)
+        return str(exception.status) in {"409", "429"}

Review Comment:
   w/o reading all code - HTTP 429 usually has a header added to request 
throttle==sleep for a period of time. "just" retrying will most-likely produce 
another 429 and then higher failure rate.
   
   As 429 might be happening all over the API, can this be handled in general? 
In Edge Worker API we for example user the lib retryhttp which as decorator 
like tenacity brings a standard set of good known retries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to