NickYadance opened a new issue, #23664:
URL: https://github.com/apache/airflow/issues/23664

   ### Description
   
   The configuration option `max_active_runs_per_dag` defines the max dagrun 
for a DAG. By default Airflow will spawn 16 dagruns if should. But when all 16 
dagruns have been spawned, they are stoping previous dagruns from retry cuz 
there is not enough room.
   A simple example is that say i define DAG with `catchUp` and 
`depends_on_past`.
   ```python
   with DAG(
           'test',
           default_args={
               'depends_on_past': True,
               'retries': 0,
   
           },
           description='Max dagrun limitation should not stop failed dagrun 
from retry',
           schedule_interval=timedelta(hours=1),
           start_date=pendulum.datetime(2022, 5, 11, tz='Asia/Singapore'),
           catchup=True,
           tags=['example'],
   ) as dag:
       task_error = BashOperator(task_id='error',
                                 bash_command='error')
       task_error 
   ```
   So Airflow will spawn all 16 dagrun for me:
   <img width="987" alt="image" 
src="https://user-images.githubusercontent.com/10060849/168016417-c2cd6185-cc29-43fe-922d-8fcb8e05077d.png";>
   After the first task fails the other 16 dagruns are just sitting there 
waiting for the first dagrun to succeed.
   But first dagrun retry will not work as it stays being queued without room 
to run: 
   <img width="984" alt="image" 
src="https://user-images.githubusercontent.com/10060849/168016932-a32a9e5e-9f9f-416d-95cd-365dfa21feb9.png";>
   
   A real life example is that when dagrun queue is full:  
   1. mark the latest dagrun to success to make up room.
   2. clear the failed dagrun to retry.
   3. clear the latest dagrun to rerun.
   After step 2, it happens that another dagrun is kicked and the room is full 
again. Then i have to mark the newest dagrun to success and rerun the dagrun in 
step 1.  In worst condition, the rerun loop just continues going and cannot be 
stopped. 
   
   ### Use case/motivation
   
   Maybe the dagrun retried, which is triggered by clearing state, should not 
count into  `max_active_runs_per_dag`. And the retried dagrun has its own max 
number limitation.
   
   
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to