johnhoran commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2821802209


##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py:
##########
@@ -188,19 +188,43 @@ def detect_pod_terminate_early_issues(pod: V1Pod) -> str 
| None:
     """
     Identify issues that justify terminating the pod early.
 
+    This method distinguishes between permanent failures (e.g., invalid image 
names)
+    and transient errors (e.g., rate limits) that should be retried by 
Kubernetes.
+
     :param pod: The pod object to check.
     :return: An error message if an issue is detected; otherwise, None.
     """
+    # Indicators in error messages that suggest transient issues
+    TRANSIENT_ERROR_PATTERNS = [
+        "pull qps exceeded",
+        "rate limit",
+        "too many requests",
+        "quota exceeded",
+        "temporarily unavailable",
+        "timeout",
+    ]
+
     pod_status = pod.status
     if pod_status.container_statuses:
         for container_status in pod_status.container_statuses:
             container_state: V1ContainerState = container_status.state
             container_waiting: V1ContainerStateWaiting | None = 
container_state.waiting
-            if container_waiting:
-                if container_waiting.reason in ["ErrImagePull", 
"ImagePullBackOff", "InvalidImageName"]:
+            if not container_waiting:
+                continue
+
+            if container_waiting.reason in ["InvalidImageName", 
"ErrImageNeverPull"]:

Review Comment:
   It depends on why the registry is unavailable.  For example if the registry 
simply doesn't exist then you'd get an error like:
   ```yaml
       state:
         waiting:
           message: 'Error response from daemon: Get "https://blah:2000/v2/": 
dialing
             blah:2000 container via direct connection because disabled has no 
HTTPS
             proxy: connecting to blah:2000: dial tcp: lookup blah: no such 
host'
           reason: ErrImagePull
   ```
   That doesn't match the keywords, so the task will fail immediately and 
you'll get both the exception message containing that reason, and if its 
enabled the logged manifest output from the pod to explain why it failed.
   
   If its a more transient error, like going over some quota, then you would 
have until the `startup_timeout` exceeds.  The triggerer would exit, the 
operator does one last check to see if the pod has started, and if not then the 
pod will be killed for not starting in time.  In this case you'd have to look 
at the logged manifest output from the operator to understand why it failed.
   
   I did think about adding a log message to this loop, but I was worried it 
would get a little verbose.  And usually when you are looking at logs in 
airflow, its to try and understand the failure after its happened, rather than 
monitoring as its happening, so the extra log message didn't seem to add much.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to