1fanwang commented on code in PR #66820:
URL: https://github.com/apache/airflow/pull/66820#discussion_r3232722270


##########
airflow-core/src/airflow/jobs/scheduler_job_runner.py:
##########
@@ -1776,6 +1776,15 @@ def _do_scheduling(self, session: Session) -> int:
             self._start_queued_dagruns(session)
             guard.commit()
 
+            # Clear DagRun objects loaded by phase 1 from the identity map so
+            # phase 2 reloads them fresh. Otherwise stale rows can be 
re-dirtied
+            # by flush/merge in _schedule_all_dag_runs and committed in a 
row-lock
+            # order that differs from what other scheduler replicas are taking
+            # for their own work, producing A-B / B-A deadlocks on dag_run and
+            # task_instance under HA scheduler deployments. See
+            # https://github.com/apache/airflow/issues/66817.
+            session.expunge_all()

Review Comment:
   Nice to meet you Ephraim, and thanks for flagging — appreciate the 
directness.
   
   > I have seen a lot of PRs from you with self created issues
   
   Yes, that's accurate.
   
   > seems to step from guesses instead of issues you experienced
   
   Not quite, most of my recent issues/PRs are actually from my direct 
experience running one of the largest set of Airflow Clusters out there (based 
on my discussions with folks at Airflow Summit 2025), raising a bunch in a 
batch because we are actively planning the Airflow 2 → 3 migration and this a a 
consolidation effort between our own 2.9.2 fork and 3.x.x - hope this context 
helps
   
   These issues and PRs come from running production Airflow at extremely large 
scale (20-30k+ DAGs per cluster, very high TI concurrency) plus actively 
planning the Airflow 2 → 3 migration. Some have hit us in production already; 
others come from defensive code review of the paths we'll lean on at cutover. 
The intent is to land the fixes upstream so the community benefits too, not 
just us.
   
   Beyond the PR/issue stream: I'm active on dev list, gave a talk at Airflow 
Summit 2025, have accepted talks for Airflow Summit 2026 and ApacheCon 
Community Over Code Glasgow 2026, and an AIP-96 + AIP-97 refresh is heading to 
the list shortly. Aiming for sustained engagement and contribution with the 
community — hopefully that context helps :)
   
   On the technical analysis itself: s**ome of the raw internal logs and traces 
can't be copy-pasted out due to company policy**, but i've been using this 
pattern for oss that works - is to repro the issue end-to-end against the OSS 
code, capture before/after evidence, and share the result here. That's what 
we've already done together on several other PR threads (e.g., the 
deterministic FAILED → PASSED snippet in this PR body's regression test) — 
happy to continue doing it so oss community have full context
   
   Will follow up with what you asked for. Just to send the right shape — would 
the most useful be:
   
   - A sanitized scheduler log with the `1213 "Deadlock found"` / `deadlock 
detected` traces against `dag_run` / `task_instance` UPDATEs?
   - A SQLAlchemy event-listener capture of the phase-2 commit set?
   - A `SHOW ENGINE INNODB STATUS` snapshot from a deadlock incident?
   - Or something else entirely?
   
   Whichever form you prefer, I'll put together and follow up here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to