1fanwang opened a new issue, #66817:
URL: https://github.com/apache/airflow/issues/66817

   ### Apache Airflow version
   
   main (development)
   
   ### What happened?
   
   `SchedulerJobRunner._do_scheduling()` runs in two phases against the same 
session:
   
   1. **Phase 1** — `_start_queued_dagruns()` then `guard.commit()` — creates 
dagrun rows and transitions queued runs to running.
   2. **Phase 2** — `_get_next_dagruns_to_examine(DagRunState.RUNNING, ...)` 
then `_schedule_all_dag_runs()` — schedules the running dagruns.
   
   After phase 1's commit, the session's identity map still holds the `DagRun` 
objects loaded by `_start_queued_dagruns()`. When phase 2 runs `flush()` / 
`merge()` during `_schedule_all_dag_runs()`, those leftover identity-map 
objects can be re-dirtied and end up in the final `guard.commit()` — but in a 
different row-lock order than what other scheduler replicas are taking for 
their own work.
   
   Under HA scheduler deployments with multiple active replicas processing 
different DAG runs, this surfaces as A-B / B-A deadlocks on the `(dag_run, 
task_instance)` lock pair. The deadlock detector kills one transaction, the 
scheduler retries the entire `_do_scheduling()` cycle, and the loop becomes 
slow under contention.
   
   ### What you think should happen instead?
   
   Clear the identity map between phase 1 and phase 2 so phase 2 starts with a 
clean view of the world.
   
   ### How to reproduce
   
   Hard to reproduce deterministically without HA schedulers under contention. 
The symptom is `(1213, "Deadlock found when trying to get lock; try restarting 
transaction")` on MySQL — or `deadlock detected` on PostgreSQL — with the 
offending statements being `UPDATE`s against `dag_run` and `task_instance`.
   
   ### Proposal
   
   Add `session.expunge_all()` right after the first `guard.commit()` inside 
`_do_scheduling()`, before `_get_next_dagruns_to_examine(DagRunState.RUNNING, 
...)`. The outer `session.expunge_all()` at the end of `_do_scheduling()` 
already does the same thing globally; this one closes the window between phase 
1 and phase 2.
   
   Small scheduler patch (around 8 lines) + a regression test that exercises 
two interleaved sessions to demonstrate the leak.
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's Code of Conduct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to