dstaple commented on issue #23361: URL: https://github.com/apache/airflow/issues/23361#issuecomment-1183499266
Regarding how the deadlock happens, as we know, we have two statements: UPDATE and SELECT ... FOR UPDATE that cause a deadlock. As described previously, the UPDATE statement is here https://github.com/apache/airflow/blob/2.2.5/airflow/models/dagrun.py#L903-L909. It needs to get a shared lock on matching rows of `task_instance` and `dag_run` tables in order to execute the update. However, it fails to do that. Before executing the `schedule_tis` function, the scheduler job already gets the lock in the `dag_run` table: https://github.com/apache/airflow/blob/2.2.5/airflow/jobs/scheduler_job.py#L884 https://github.com/apache/airflow/blob/2.2.5/airflow/models/dagrun.py#L287 So it means the UPDATE statement is failing at acquiring the matching row lock of the `task_instance` table. In the meantime, the SELECT ... FOR UPDATE statement `select ... from task_instance join dag_run ... for update` also needs a lock of the matching rows in both task_instance and dag_run tables. It first acquires such a lock of the `task_instance` table, but then attempts to get the lock of the `dag_run` table, which was already acquired by the UPDATE statement. Therefore, a deadlock happens. Regarding how this might be fixed, the suspicion is `TaskInstance.refresh_from_db()` only needs to lock the rows in the task_instance table, not the dag_run table. This implies the deadlocks might be worked around by replacing `with_for_update()` with `with_for_update(of=TaskInstance)` here https://github.com/apache/airflow/blob/2.2.5/airflow/models/taskinstance.py#L736, at least for PostgreSQL and Oracle. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
