dirrao commented on code in PR #36882:
URL: https://github.com/apache/airflow/pull/36882#discussion_r1463426729
##########
airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py:
##########
@@ -434,9 +434,9 @@ def sync(self) -> None:
)
self.fail(task[0], e)
except ApiException as e:
- # These codes indicate something is wrong with pod
definition; otherwise we assume pod
- # definition is ok, and that retrying may work
- if e.status in (400, 422):
+ # In case of the below error codes, fail the task and
honor the task retires.
+ # Otherwise, go for continuous/infinite retries.
+ if e.status in (400, 403, 404, 422):
Review Comment:
Ideally, the scheduler pool slots should be mapped to the namespace quota.
So, that we don't end up in quota exceeds exception. Even if they deviate over
time, the customer should be aware of the failures instead of masking/hiding
the failures. So, that the customer can adjust the pool slots as per the quota.
I believe this problem falls under resource-aware scheduling and we have to
think about it holistically across multiple executors.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]