dhuang opened a new issue #17638:
URL: https://github.com/apache/airflow/issues/17638


   **Apache Airflow version**: 2.1.2.
   
   **OS**: Debian.
   
   **Apache Airflow Provider versions**: Probably not relevant.
   
   **Deployment**: Single scheduler instance since we're on MySQL5.7.
   
   **What happened**: 
   Since updating from 1.10.15 to 2.1.2, we started noticing a small subset of 
DAGs would no longer get new DAG Runs scheduled (roughly 30 of 5000~ DAGs), 
while the rest worked perfectly fine. We've been able to trigger manual runs 
with these DAGs with no issues and found no other errors/warnings in any logs. 
When restarting the scheduler, we'd sometimes see the next interval get 
scheduled, but then once again get stuck after the first new run. 
   
   After some investigation, I noticed a common attribute among these stuck 
DAGs were that their `next_dagrun_create_after=NULL`, they were DAGs we set to 
`max_active_runs=1`, and more often they were on shorter intervals (every 
5-15min, but sometimes still daily). These DAGs are otherwise all different and 
are a mix of static/dynamic.
   
   **What you expected to happen**: 
   Digging into the new scheduler logic, I saw that the 
`next_dagrun_create_after` is getting set to `NULL` when `max_active_runs` is 
reached in 
https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L2304. Then 
I think the filter in 
https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L2276 would 
prevent the DAG from getting considered for scheduling again until 
`next_dagrun_create_after` is set to a non-null value again.
   
   I think that is supposed to happen in 
https://github.com/apache/airflow/blob/2.1.2/airflow/models/dag.py#L229, but 
`next_dagrun_create_after` remains stuck at `NULL` after all pending runs are 
complete. I verified when the prior DAG run finishes that `max_active_runs` is 
indeed not met by querying the database directly and saw no "DAG %s is at (or 
above) max_active_runs (%d of %d), not creating any more runs". If I update 
`next_dagrun_create_after` manually, a run will be scheduled right away, but 
then get stuck again after that.
   
   Can workaround by getting rid of `max_active_runs` which returns all 
scheduling to normal, but obviously then gets rid of the desired cap.
   
   **How to reproduce it**: Shortest way probably create a DAG with 
`max_active_runs=1`, `scheduler_interval="0 */1 * * *"`, and a`BashOperator`  
task that sleeps for 5min?
   
   **Anything else we need to know**: Nothing else in mind.
   
   **Are you willing to submit a PR?** Yes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to