collinmcnulty opened a new issue, #49971: URL: https://github.com/apache/airflow/issues/49971
### Apache Airflow version 2.10.5 ### If "Other Airflow 2 version" selected, which one? _No response_ ### What happened? A reschedule sensor was running for many hours. Over the course of that period, three of the "attempts" failed to go from queued to running in time. The task was then marked as failed by the requeue system (which bypasses retries). ### What you think should happen instead? When a reschedule sensor makes it to running, the requeue count should "reset". I only care if it failed to queue->running three times _in a row_. The current system assumes that a single try will only go from queued to running once, but reschedule sensors break that assumption. ### How to reproduce 1. set a reschedule sensor to run for a long time, that will never have its condition met 2. use celery executor 3. remove all workers in the task's queue for a brief period, so it fails to get to running 4. return the workers, allowing it to get to running 5. repeat 3 and 4 two more times ### Operating System debian bookworm ### Versions of Apache Airflow Providers Was using external task sensor in reschedule mode. ### Deployment Astronomer ### Deployment details _No response_ ### Anything else? I believe the right fix for this is in `_get_num_times_stuck_in_queued` in scheduler_job_runner.py First check if there are any `running` events, then if there are, filter the existing query to datetimes after the most recent `running` event. ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
