rachthree commented on code in PR #53368:
URL: https://github.com/apache/airflow/pull/53368#discussion_r2624402228
##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py:
##########
@@ -461,7 +452,9 @@ def get_pods(
pod_list: Sequence[k8s.V1Pod] = []
retry_number: int = 0
- while len(pod_list) != self.parallelism or retry_number <=
self.discover_pods_retry_number:
+ while retry_number <= self.discover_pods_retry_number:
Review Comment:
First off, thank you @stephen-bracken for looking into this!
My question here is for clusters with limited resources and have Kueue
enabled. The k8s Job will eventually spool up the pods but it will take a while
if there is resource contention. In our case, it's a GPU cluster and many jobs
reserve GPUs with potentially long runtimes. Could a configurable timeout
instead of a retry number be used here? Or, don't have a retry number /
timeout, and rely on DAG / task timeout instead (which is what my team
preferred and is inherent in the solution I applied locally).
Thanks again!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]