dstaple commented on issue #23361:
URL: https://github.com/apache/airflow/issues/23361#issuecomment-1226445063

   I have an update on the above.
   
   We are running a fork of 2.2.5 with the fix from 
https://github.com/apache/airflow/pull/25312 cherry-picked and are no longer 
seeing the specific deadlock outlined 
[above](https://github.com/apache/airflow/issues/23361#issuecomment-1179314412).
 We were seeing deadlocks every ~2h when running with vanilla Airflow 2.2.5, so 
this is a big improvement.
   
   Other types of deadlocks are still possible, and we have observed a handful 
of these even with the fix to `TaskInstance.refresh_from_db()`. An example 
deadlock which is present in 2.2.5 and not resolved by 
https://github.com/apache/airflow/pull/25312 can be seen in this Postgres log:
   ```
   2022-08-23 17:06:35.347 UTC [262090] ERROR:  deadlock detected
   2022-08-23 17:06:35.347 UTC [262090] DETAIL:  Process 262090 waits for 
ShareLock on transaction 366414866; blocked by process 56942.
       Process 56942 waits for ExclusiveLock on tuple (1466,144) of relation 
165112 of database 165033; blocked by process 56420.
       Process 56420 waits for ShareLock on transaction 366414616; blocked by 
process 262090.
       Process 262090: UPDATE task_instance SET state='scheduled' WHERE 
task_instance.dag_id = 'dummy_dag_id' AND task_instance.run_id = 
'scheduled__2022-08-19T20:00:00+00:00' AND task_instance.task_id IN 
('dummy_task_1', 'dummy_task_2', 'dummy_task_N')
       Process 56942: UPDATE dag_run SET 
last_scheduling_decision='2022-08-23T16:59:29.934171+00:00'::timestamptz WHERE 
[dag_run.id](http://dag_run.id/) = 132182
       Process 56420: UPDATE dag_run SET 
last_scheduling_decision='2022-08-23T16:59:29.934171+00:00'::timestamptz WHERE 
[dag_run.id](http://dag_run.id/) = 132182
   2022-08-23 17:06:35.347 UTC [262090] HINT:  See server log for query details.
   2022-08-23 17:06:35.347 UTC [262090] CONTEXT:  while updating tuple 
(56209,27) in relation "task_instance"
   2022-08-23 17:06:35.347 UTC [262090] STATEMENT:  UPDATE task_instance SET 
state='scheduled' WHERE task_instance.dag_id = 'dummy_dag_id' AND 
task_instance.run_id = 'scheduled__2022-08-19T20:00:00+00:00' AND 
task_instance.task_id IN ('dummy_task_1', 'dummy_task_2', 'dummy_task_N')
   ```
   
   I believe it is also possible to get deadlocks when manually (via the UI) 
updating the state of large numbers of tasks, deleting DAG runs while they have 
many running tasks, and other invasive methods, but I don't have logs to show 
for that at the moment.
   
   I think it will be difficult to completely eliminate deadlocks as long as we 
have situations where multiple tables need to be simultaneously locked, and we 
don't standardize the order in which such mutual locks are applied across the 
Airflow codebase. Brainstorming along these lines, one approach that could 
reduce the potential for deadlocks would be to favor database access via 
methods like `TaskInstance.refresh_from_db()` and disfavor direct usage of the 
ORM elsewhere in the codebase. Then, if one is systematic about applying the 
minimal locks / controlling the order in which locks are applied in the methods 
where the ORM is directly used, all the downstream code will benefit.
   
   Regardless I would like to stress that we were seeing deadlocks every ~2h 
and are now seeing them rarely and under different circumstances. So the fix to 
`TaskInstance.refresh_from_db()` made a huge difference.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to