johnhoran commented on code in PR #61778:
URL: https://github.com/apache/airflow/pull/61778#discussion_r2821802209
##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/utils/pod_manager.py:
##########
@@ -188,19 +188,43 @@ def detect_pod_terminate_early_issues(pod: V1Pod) -> str
| None:
"""
Identify issues that justify terminating the pod early.
+ This method distinguishes between permanent failures (e.g., invalid image
names)
+ and transient errors (e.g., rate limits) that should be retried by
Kubernetes.
+
:param pod: The pod object to check.
:return: An error message if an issue is detected; otherwise, None.
"""
+ # Indicators in error messages that suggest transient issues
+ TRANSIENT_ERROR_PATTERNS = [
+ "pull qps exceeded",
+ "rate limit",
+ "too many requests",
+ "quota exceeded",
+ "temporarily unavailable",
+ "timeout",
+ ]
+
pod_status = pod.status
if pod_status.container_statuses:
for container_status in pod_status.container_statuses:
container_state: V1ContainerState = container_status.state
container_waiting: V1ContainerStateWaiting | None =
container_state.waiting
- if container_waiting:
- if container_waiting.reason in ["ErrImagePull",
"ImagePullBackOff", "InvalidImageName"]:
+ if not container_waiting:
+ continue
+
+ if container_waiting.reason in ["InvalidImageName",
"ErrImageNeverPull"]:
Review Comment:
It depends on why the registry is unavailable. For example if the registry
simply doesn't exist then you'd get an error like:
```yaml
state:
waiting:
message: 'Error response from daemon: Get "https://blah:2000/v2/":
dialing
blah:2000 container via direct connection because disabled has no
HTTPS
proxy: connecting to blah:2000: dial tcp: lookup blah: no such
host'
reason: ErrImagePull
```
That doesn't match the keywords, so the task will fail immediately and
you'll get both the exception message containing that reason, and if its
enabled the logged manifest output from the pod to explain why it failed.
If its a more transient error, like going over some quota, then you would
have until the `startup_timeout` exceeds. The triggerer would exit, the
operator does one last check to see if the pod has started, and if not then the
pod will be killed for not starting in time. In this case you'd have to look
at the logged manifest output from the operator to understand why it failed.
I did think about adding a log message to this loop, but I was worried it
would get a little verbose. And usually when you are looking at logs in
airflow, its to try and understand the failure after its happened, rather than
monitoring as its happening, so the extra log message didn't seem to add much.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]