aiqc opened a new issue, #38298: URL: https://github.com/apache/airflow/issues/38298
### Description I am running AWS Batch jobs using the BatchOperator. When I use Spot Instances, sometimes the job gets stuck in a "Runnable" status in the AWS Batch Queue. Probably because there are no spot capacity instances available. Eventually, the job will show `statusReason`:`UNDETERMINED - Batch job is blocked, root cause is undetermined.` Meanwhile, Airflow shows these stuck jobs in the green "running" state and keeps polling them. I have both `retries` and `retry_delay` defined, but I don't think they are being triggered because of this running state. ### Use case/motivation Cost savings -- I'd like a bit more retry functionality baked into the BatchOperator so that I can run spot instance jobs affordably ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
