dstaple commented on issue #23361:
URL: https://github.com/apache/airflow/issues/23361#issuecomment-1194377202

   > it locks the TaskInstance and DagRun in REVERSE ORDER comparing to what 
Scheduler does.
   
   @potiuk I re-read your explanation and agree completely, the essence of the 
specific deadlock I did the deep-dive on above is 
`TaskInstance.refresh_from_db` and `DagRun.schedule_tis` both lock 
`task_instance` and `dag_run`, but in opposite orders, which causes a deadlock. 
I agree if we change `TaskInstance.refresh_from_db` to guarantee the order, 
that should prevent deadlocks between those two statements.
   
   Regarding @RNHTTR repro'ing a deadlock by deleting a running DAG run, it 
might be related, but it isn't exactly the same situation. At the minimum the 
deadlock forced by @RNHTTR probably involves a DELETE statement rather than an 
UPDATE. Indeed @argibbs reported deadlocks when deleting DAGs and they had 
DELETE statements for one of the queries (the other was not reported). It's 
possible that the "other query" in that case was the same SELECT ... FOR UPDATE 
from `TaskInstance.refresh_from_db` I reported above, but I don't want to get 
ahead of myself here. (Of course it's still worth checking if #25266 resolves 
the DELETE deadlocks in case we're lucky and they're the same issue.)
   
   Regarding `with_for_update(of=TaskInstance)` I previously suggested, 
understanding you've rejected this as a potential solution, it's worth 
reporting that I internally forked Airflow 2.2.5 and added this change about 10 
days ago and it completely eliminated the deadlocks between 
`TaskInstance.refresh_from_db` and `DagRun.schedule_tis`, which were occuring 
every 2h or so and have not occurred since. At the minimum this confirms our 
understanding above.
   
   I can check your alternative proposed fix 
https://github.com/apache/airflow/pull/25266 in a staging environment but it 
may be a week or two before I can deploy more widely and conclusively report on 
whether or not it fixes the scheduler UPDATE deadlocking with the task instance 
SELECT ... FOR UPDATE above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to