potiuk edited a comment on pull request #21362: URL: https://github.com/apache/airflow/pull/21362#issuecomment-1030853621
How I got to that: Gist here: https://gist.github.com/tulanowski/fcc8358bad3c8e5d15678639ec015d8b Query 1 (just a fragment of it) - this is scheduler trying to get the task instances to consider for scheduling: ``` Failed to execute task (MySQLdb._exceptions.OperationalError) (1213, 'Deadlock found when trying to get lock; try restarting transaction') [SQL: SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.start_date AS dag_run_1.dag_hash AS dag_run_1_dag_hash FROM task_instance INNER JOIN dag_run AS dag_run_1 ON dag_run_1.dag_id = task_instance.dag_id AND dag_run_1.run_id = task_instance.run_id WHERE task_instance.dag_id = %s AND task_instance.task_id = %s AND task_instance.run_id = %s LIMIT %s FOR UPDATE] ``` This is this query (TRANSACTION 1 from the "server log"): ``` SELECT task_instance.try_number AS task_instance_try_number, task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.run_id AS task_instance_run_id, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS .... ``` Query 2: This is is TaskReschedule insert which only happens (as far as I checked) when AirlfowRescheduleException is thrown during task execution (TRANSACTION 2 from the server log): ``` INSERT INTO task_reschedule (task_id, dag_id, run_id, try_number, start_date, end_date, duration, reschedule_date) VALUES ('raw_sensor.raw_input_shop.ipad_orders_from_popup_store', 'dwh_core.main', 'scheduled__2022-01-27T00:15:00+00:00', 1, '2022-01-28 02:16:08.000687', '2022-01-28 02:16:08.142040', 0.141353e0, '2022-01-28 02:17:08.129220') ``` Both of them are waiting for this lock: ``` *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 577 page no 2196 n bits 296 index dag_run_dag_id_run_id_key of table `airflow`.`dag_run` trx id 844106125 lock mode S locks rec but not gap waiting Record lock, heap no 230 PHYSICAL RECORD: n_fields 3; compact format; info bits 0 0: len 13; hex 6477685f636f72652e6d61696e; asc dwh_core.main;; 1: len 30; hex 7363686564756c65645f5f323032322d30312d32375430303a31353a3030; asc scheduled__2022-01-27T00:15:00; (total 36 bytes); 2: len 4; hex 800198fa; asc ;; ``` How I understand this - this one is a lock on index of the dag_run primary key which needs to be updated because we are inserting a row in TaskReschedule, and because of the 'DagRun" relationship in the TaskReschedule object, this one needs to be locked when TaskReschedule related to the same dag_run_id needs to be updated. So what I think happens: 1) Scheduler locks the DAGRun entity (but not dag run index ! ) 2) Task throws AirflowRescheduleException and it tries to insert 'task_reschedule' and attempts to get the index lock (but cannot get it because that also needs to get the DagRun entity lock so it waits for it) 2) The UPDATE TASK_INSTANCE and this one tries to get the index lock - but it can't because the insert already has it Classic deadlock. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
