dstaple commented on issue #23361: URL: https://github.com/apache/airflow/issues/23361#issuecomment-1226445063
I have an update on the above. We are running a fork of 2.2.5 with the fix from https://github.com/apache/airflow/pull/25312 cherry-picked and are no longer seeing the specific deadlock outlined [above](https://github.com/apache/airflow/issues/23361#issuecomment-1179314412). We were seeing deadlocks every ~2h when running with vanilla Airflow 2.2.5, so this is a big improvement. Other types of deadlocks are still possible, and we have observed a handful of these even with the fix to `TaskInstance.refresh_from_db()`. An example deadlock which is present in 2.2.5 and not resolved by https://github.com/apache/airflow/pull/25312 can be seen in this Postgres log: ``` 2022-08-23 17:06:35.347 UTC [262090] ERROR: deadlock detected 2022-08-23 17:06:35.347 UTC [262090] DETAIL: Process 262090 waits for ShareLock on transaction 366414866; blocked by process 56942. Process 56942 waits for ExclusiveLock on tuple (1466,144) of relation 165112 of database 165033; blocked by process 56420. Process 56420 waits for ShareLock on transaction 366414616; blocked by process 262090. Process 262090: UPDATE task_instance SET state='scheduled' WHERE task_instance.dag_id = 'dummy_dag_id' AND task_instance.run_id = 'scheduled__2022-08-19T20:00:00+00:00' AND task_instance.task_id IN ('dummy_task_1', 'dummy_task_2', 'dummy_task_N') Process 56942: UPDATE dag_run SET last_scheduling_decision='2022-08-23T16:59:29.934171+00:00'::timestamptz WHERE [dag_run.id](http://dag_run.id/) = 132182 Process 56420: UPDATE dag_run SET last_scheduling_decision='2022-08-23T16:59:29.934171+00:00'::timestamptz WHERE [dag_run.id](http://dag_run.id/) = 132182 2022-08-23 17:06:35.347 UTC [262090] HINT: See server log for query details. 2022-08-23 17:06:35.347 UTC [262090] CONTEXT: while updating tuple (56209,27) in relation "task_instance" 2022-08-23 17:06:35.347 UTC [262090] STATEMENT: UPDATE task_instance SET state='scheduled' WHERE task_instance.dag_id = 'dummy_dag_id' AND task_instance.run_id = 'scheduled__2022-08-19T20:00:00+00:00' AND task_instance.task_id IN ('dummy_task_1', 'dummy_task_2', 'dummy_task_N') ``` I believe it is also possible to get deadlocks when manually (via the UI) updating the state of large numbers of tasks, deleting DAG runs while they have many running tasks, and other invasive methods, but I don't have logs to show for that at the moment. I think it will be difficult to completely eliminate deadlocks as long as we have situations where multiple tables need to be simultaneously locked, and we don't standardize the order in which such mutual locks are applied across the Airflow codebase. Brainstorming along these lines, one approach that could reduce the potential for deadlocks would be to favor database access via methods like `TaskInstance.refresh_from_db()` and disfavor direct usage of the ORM elsewhere in the codebase. Then, if one is systematic about applying the minimal locks / controlling the order in which locks are applied in the methods where the ORM is directly used, all the downstream code will benefit. Regardless I would like to stress that we were seeing deadlocks every ~2h and are now seeing them rarely and under different circumstances. So the fix to `TaskInstance.refresh_from_db()` made a huge difference. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
