1fanwang commented on code in PR #66820:
URL: https://github.com/apache/airflow/pull/66820#discussion_r3232722270
##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1776,6 +1776,15 @@ def _do_scheduling(self, session: Session) -> int:
self._start_queued_dagruns(session)
guard.commit()
+ # Clear DagRun objects loaded by phase 1 from the identity map so
+ # phase 2 reloads them fresh. Otherwise stale rows can be
re-dirtied
+ # by flush/merge in _schedule_all_dag_runs and committed in a
row-lock
+ # order that differs from what other scheduler replicas are taking
+ # for their own work, producing A-B / B-A deadlocks on dag_run and
+ # task_instance under HA scheduler deployments. See
+ # https://github.com/apache/airflow/issues/66817.
+ session.expunge_all()
Review Comment:
Nice to meet you Ephraim, and thanks for flagging — appreciate the
directness.
> I have seen a lot of PRs from you with self created issues
Yes, that's accurate.
> seems to step from guesses instead of issues you experienced
Not quite, most of my recent issues/PRs are actually from my direct
experience running one of the largest set of Airflow Clusters out there (based
on my discussions with folks at Airflow Summit 2025), raising a bunch in a
batch because we are actively planning the Airflow 2 → 3 migration and this a a
consolidation effort between our own 2.9.2 fork and 3.x.x - hope this context
helps
These issues and PRs come from running production Airflow at extremely large
scale (20-30k+ DAGs per cluster, very high TI concurrency) plus actively
planning the Airflow 2 → 3 migration. Some have hit us in production already;
others come from defensive code review of the paths we'll lean on at cutover.
The intent is to land the fixes upstream so the community benefits too, not
just us.
Beyond the PR/issue stream: I'm active on dev list, gave a talk at Airflow
Summit 2025, have accepted talks for Airflow Summit 2026 and ApacheCon
Community Over Code Glasgow 2026, and an AIP-96 + AIP-97 refresh is heading to
the list shortly. Aiming for sustained engagement and contribution with the
community — hopefully that context helps :)
On the technical analysis itself: s**ome of the raw internal logs and traces
can't be copy-pasted out due to company policy**, but i've been using this
pattern for oss that works - is to repro the issue end-to-end against the OSS
code, capture before/after evidence, and share the result here. That's what
we've already done together on several other PR threads (e.g., the
deterministic FAILED → PASSED snippet in this PR body's regression test) —
happy to continue doing it so oss community have full context
Will follow up with what you asked for. Just to send the right shape — would
the most useful be:
- A sanitized scheduler log with the `1213 "Deadlock found"` / `deadlock
detected` traces against `dag_run` / `task_instance` UPDATEs?
- A SQLAlchemy event-listener capture of the phase-2 commit set?
- A `SHOW ENGINE INNODB STATUS` snapshot from a deadlock incident?
- Or something else entirely?
Whichever form you prefer, I'll put together and follow up here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]