Re: [PR] remove references to KubernetesJobOperator.get_or_create_pod() to fix creating duplicate pods [airflow]

via GitHub Tue, 16 Dec 2025 10:59:35 -0800


rachthree commented on code in PR #53368:
URL: https://github.com/apache/airflow/pull/53368#discussion_r2624402228



##########
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/job.py:
##########
@@ -461,7 +452,9 @@ def get_pods(
         pod_list: Sequence[k8s.V1Pod] = []
         retry_number: int = 0
 
-        while len(pod_list) != self.parallelism or retry_number <= 
self.discover_pods_retry_number:
+        while retry_number <= self.discover_pods_retry_number:

Review Comment:
   First off, thank you @stephen-bracken for looking into this!
   
   My question here is for clusters with limited resources and have Kueue 
enabled. The k8s Job will eventually spool up the pods but it will take a while 
if there is resource contention. In our case, it's a GPU cluster and many jobs 
reserve GPUs with potentially long runtimes. Could a configurable timeout 
instead of a retry number be used here? Or, don't have a retry number / 
timeout, and rely on DAG / task timeout instead (which is what my team 
preferred and is inherent in the solution I applied locally).
   
   Thanks again!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] remove references to KubernetesJobOperator.get_or_create_pod() to fix creating duplicate pods [airflow]

Reply via email to