[I] KubernetesPodOperator failures when cluster over utilized [airflow]

via GitHub Mon, 16 Jun 2025 06:51:09 -0700


johnhoran opened a new issue, #51789:
URL: https://github.com/apache/airflow/issues/51789


   ### Description
   
   If you are running a task in the celery executor and there aren't enough 
workers slots free to actually pick up the task, then your task will be in a 
queued state until the workers free up and can pick up the task.  If the task 
is running something using kubernetespodoperator, and the cluster doesn't have 
enough space to accommodate the pod, then kubernetes will return an error and 
your task will fail.
   
   ### Use case/motivation
   
   Ideally the task would remain in queued state until there are enough 
kubernetes resources to accommodate it, but that seems feels like a massive 
change.  
   So instead I'd propose that the task should catch this type of kubernetes 
exception and go into deferred mode for a configurable amount of time and then 
retry until the pod gets created.  In this scenario the time spent in deferred 
mode would count against the task timeout, while time spent queued in airflow 
doesn't, but I'd argue that is better than task failure.  
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] KubernetesPodOperator failures when cluster over utilized [airflow]

Reply via email to