GitHub user JakeKandell edited a discussion: Multiple Airflow Schedulers
Processing Same Dagruns?
Hi all, we run a large Airflow cluster and recently encountered some strange
scheduling issues. We have a distributed Kubernetes setup with multiple
scheduler pods and a MySQL DB on Airflow 2.9.2.
In the scheduler logs, we see that there are multiple schedulers processing the
same finished dagrun simulatenously. They all try to mark this dagrun as
failed.
```
POD: airflow-holdem-scheduler-6fb47f86fd-687gw
[2025-06-25T12:41:02.170+0000] {dagrun.py:822} ERROR - Marking run <DagRun
hive_tracking_bmr_holdem__data_motion_kafka_airflow @ 2025-06-25
11:00:00+00:00: scheduled__2025-06-25T11:00:00+00:00, state:running, queued_at:
2025-06-25 11:51:22+00:00. externally triggered: False> failed
```
```
POD: airflow-holdem-scheduler-6fb47f86fd-nlj74
[2025-06-25T12:41:02.612+0000] {dagrun.py:822} ERROR - Marking run <DagRun
hive_tracking_bmr_holdem__data_motion_kafka_airflow @ 2025-06-25
11:00:00+00:00: scheduled__2025-06-25T11:00:00+00:00, state:running, queued_at:
2025-06-25 11:51:22+00:00. externally triggered: False> failed
```
Furthermore, some of the schedulers report `active_runs=1`
```
[2025-06-25T12:41:02.986+0000] {scheduler_job_runner.py:1333} INFO - DAG
hive_tracking_bmr_holdem__data_motion_kafka_airflow is at (or above)
max_active_runs (1 of 1), not creating any more runs
```
Other schedulers try to set the `next_dagrun` (meaning `active_runs` was
correctly identified as 0)
```
[2025-06-25T12:41:03.767+0000] {dag.py:3954} INFO - Setting next_dagrun for
hive_tracking_bmr_holdem__data_motion_kafka_airflow to 2025-06-25
12:00:00+00:00, run_after=2025-06-25 12:30:00+00:00
```
We believe this may have led to a race condition which caused
`next_dagrun_create_after` to be set to NULL preventing this DAG from being
properly scheduled moving forward.
Here are the [full
logs](https://docs.google.com/spreadsheets/d/1fx6yd1F_h4FOw_y61hRG0eibw3bW6kGXiJSHoYb8y_Y/edit?usp=sharing)
for this DAG.
>From my review of [the
>code](https://github.com/apache/airflow/blob/2.9.2/airflow/jobs/scheduler_job_runner.py#L1047),
> I see that the dagrun rows are [supposed to be
>locked](https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/dagrun.py#L551)
> when selected, so it's not clear to me how the same dagrun is being processed
>by different schedulers. I haven't been able find a way to reproduce this in a
>test cluster.
Has anyone experienced this? Do we know how this can happen?
GitHub link: https://github.com/apache/airflow/discussions/53470
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]