dstaple commented on issue #23361: URL: https://github.com/apache/airflow/issues/23361#issuecomment-1194377202
> it locks the TaskInstance and DagRun in REVERSE ORDER comparing to what Scheduler does. @potiuk I re-read your explanation and agree completely, the essence of the specific deadlock I did the deep-dive on above is `TaskInstance.refresh_from_db` and `DagRun.schedule_tis` both lock `task_instance` and `dag_run`, but in opposite orders, which causes a deadlock. I agree if we change `TaskInstance.refresh_from_db` to guarantee the order, that should prevent deadlocks between those two statements. Regarding @RNHTTR repro'ing a deadlock by deleting a running DAG run, it might be related, but it isn't exactly the same situation. At the minimum the deadlock forced by @RNHTTR probably involves a DELETE statement rather than an UPDATE. Indeed @argibbs reported deadlocks when deleting DAGs and they had DELETE statements for one of the queries (the other was not reported). It's possible that the "other query" in that case was the same SELECT ... FOR UPDATE from `TaskInstance.refresh_from_db` I reported above, but I don't want to get ahead of myself here. (Of course it's still worth checking if #25266 resolves the DELETE deadlocks in case we're lucky and they're the same issue.) Regarding `with_for_update(of=TaskInstance)` I previously suggested, understanding you've rejected this as a potential solution, it's worth reporting that I internally forked Airflow 2.2.5 and added this change about 10 days ago and it completely eliminated the deadlocks between `TaskInstance.refresh_from_db` and `DagRun.schedule_tis`, which were occuring every 2h or so and have not occurred since. At the minimum this confirms our understanding above. I can check your alternative proposed fix https://github.com/apache/airflow/pull/25266 in a staging environment but it may be a week or two before I can deploy more widely and conclusively report on whether or not it fixes the scheduler UPDATE deadlocking with the task instance SELECT ... FOR UPDATE above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
