GitHub user JakeKandell edited a discussion: Multiple Airflow Schedulers 
Processing Same Dagruns?

Hi all, we run a large Airflow cluster and recently encountered some strange 
scheduling issues. We have a distributed Kubernetes setup with multiple 
scheduler pods and a MySQL DB on Airflow 2.9.2. 

In the scheduler logs, we see that there are multiple schedulers processing the 
same finished dagrun simulatenously. They all try to mark this dagrun as 
failed. 
```
POD: airflow-holdem-scheduler-6fb47f86fd-687gw
[2025-06-25T12:41:02.170+0000] {dagrun.py:822} ERROR - Marking run <DagRun 
hive_tracking_bmr_holdem__data_motion_kafka_airflow @ 2025-06-25 
11:00:00+00:00: scheduled__2025-06-25T11:00:00+00:00, state:running, queued_at: 
2025-06-25 11:51:22+00:00. externally triggered: False> failed
```
```
POD: airflow-holdem-scheduler-6fb47f86fd-nlj74
[2025-06-25T12:41:02.612+0000] {dagrun.py:822} ERROR - Marking run <DagRun 
hive_tracking_bmr_holdem__data_motion_kafka_airflow @ 2025-06-25 
11:00:00+00:00: scheduled__2025-06-25T11:00:00+00:00, state:running, queued_at: 
2025-06-25 11:51:22+00:00. externally triggered: False> failed
```

Furthermore, some of the schedulers report `active_runs=1`
```
[2025-06-25T12:41:02.986+0000] {scheduler_job_runner.py:1333} INFO - DAG 
hive_tracking_bmr_holdem__data_motion_kafka_airflow is at (or above) 
max_active_runs (1 of 1), not creating any more runs
```
Other schedulers try to set the `next_dagrun` (meaning `active_runs` was 
correctly identified as 0)
```
[2025-06-25T12:41:03.767+0000] {dag.py:3954} INFO - Setting next_dagrun for 
hive_tracking_bmr_holdem__data_motion_kafka_airflow to 2025-06-25 
12:00:00+00:00, run_after=2025-06-25 12:30:00+00:00
```

We believe this may have led to a race condition which caused 
`next_dagrun_create_after` to be set to NULL preventing this DAG from being 
properly scheduled moving forward.

Here are the [full 
logs](https://docs.google.com/spreadsheets/d/1fx6yd1F_h4FOw_y61hRG0eibw3bW6kGXiJSHoYb8y_Y/edit?usp=sharing)
 for this DAG.

>From my review of [the 
>code](https://github.com/apache/airflow/blob/2.9.2/airflow/jobs/scheduler_job_runner.py#L1047),
> I see that the dagrun rows are [supposed to be 
>locked](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/dagrun.py#L551)
> when selected, so it's not clear to me how the same dagrun is being processed 
>by different schedulers. I haven't been able find a way to reproduce this in a 
>test cluster.

Has anyone experienced this? Do we know how this can happen?

GitHub link: https://github.com/apache/airflow/discussions/53470

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to