1fanwang commented on PR #66820: URL: https://github.com/apache/airflow/pull/66820#issuecomment-4688055760
Closing this — I couldn't reproduce the root cause this PR assumes. I set up a live multi-scheduler repro on `main` rather than reasoning about it from synthesis, and the `dag_run` / `task_instance` deadlock this targets didn't surface. Both fetches in the two phases already take their rows with `FOR UPDATE SKIP LOCKED`, so the schedulers don't contend on the same dag-run rows, and the before/after SQL is identical with vs without `expunge_all()`. <details><summary>Live repro</summary> ``` # apache/airflow main, fresh MySQL 8.0 metadata DB (real `airflow db migrate`) # 24 catchup DAGs (1-min schedule, fan-out/fan-in tasks), dags unpaused # 2–3 real `airflow scheduler` processes against the same DB, concurrently # after ~3 min under load: ~2,000 dag runs + ~10,000 task instances created # → the dag_run / task_instance deadlock this PR targets: did not occur # → before/after SQL (with vs without expunge_all between phases): identical # → no functional difference: same dag-run/TI progression, no duplicates ``` </details> Rather than keep a fix up that I can't back with a real repro, I'd rather take a step back and benchmark this properly before pursuing it again. @ephraimbuddy — you were right to push for a real log on this; thanks for that. @potiuk thanks for the triage. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
