dhuang opened a new issue #17638: URL: https://github.com/apache/airflow/issues/17638
**Apache Airflow version**: 2.1.2. **OS**: Debian. **Apache Airflow Provider versions**: Probably not relevant. **Deployment**: Single scheduler instance since we're on MySQL5.7. **What happened**: Since updating from 1.10.15 to 2.1.2, we started noticing a small subset of DAGs would no longer get new DAG Runs scheduled (roughly 30 of 5000~ DAGs), while the rest worked perfectly fine. We've been able to trigger manual runs with these DAGs with no issues and found no other errors/warnings in any logs. When restarting the scheduler, we'd sometimes see the next interval get scheduled, but then once again get stuck after the first new run. After some investigation, I noticed a common attribute among these stuck DAGs were that their `next_dagrun_create_after=NULL`, they were DAGs we set to `max_active_runs=1`, and more often they were on shorter intervals (every 5-15min, but sometimes still daily). These DAGs are otherwise all different and are a mix of static/dynamic. **What you expected to happen**: Digging into the new scheduler logic, I saw that the `next_dagrun_create_after` is getting set to `NULL` when `max_active_runs` is reached in https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L2304. Then I think the filter in https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L2276 would prevent the DAG from getting considered for scheduling again until `next_dagrun_create_after` is set to a non-null value again. I think that is supposed to happen in https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L229, but `next_dagrun_create_after` remains stuck at `NULL` after all pending runs are complete. I verified when the prior DAG run finishes that `max_active_runs` is indeed not met by querying the database directly and saw no "DAG %s is at (or above) max_active_runs (%d of %d), not creating any more runs". If I update `next_dagrun_create_after` manually, a run will be scheduled right away, but then get stuck again after that. Can workaround by getting rid of `max_active_runs` which returns all scheduling to normal, but obviously then gets rid of the desired cap. **How to reproduce it**: Shortest way probably create a DAG with `max_active_runs=1`, `scheduler_interval="0 */1 * * *"`, and a`BashOperator` task that sleeps for 5min? **Anything else we need to know**: Nothing else in mind. **Are you willing to submit a PR?** Yes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
