The GitHub Actions job "Tests" on airflow.git has succeeded.
Run started by GitHub user potiuk (triggered by potiuk).

Head commit for run:
12df02bf4be0f4424072d253224efa3ffdab2c8f / Jarek Potiuk <[email protected]>
Fix Deadlock on refresh from DB by local task run

This PR attempts to fix the deadlock that occurs when task instance
is being run in parallel to running _do_scheduling operation
executing get_next_dagruns_to_examine.

The whole scheduling is based on actually locking DagRuns scheduler
operats on - and it basically means that state of ANY task instances
for that DagRun should not change during the scheduling.

However there are some cases where task instance is locked
FOR UPDATE without prior locking of the DagRun table - this
happens for example when local task job executes the task
and runs "check_and_change_state_before_execution" method on the
task instance it runs. There is no earlier DagRun locking
happening and the "refresh_from_db" run with lock_for_update
will get the lock on both TaskInstance row as well as on the
DagRun row. The problem is this locking happens in reverse sequence
in this case:

1) get_next_dagruns_to_examine - locks DagRun first and THEN
   tries to locks some task instances for that DagRun

2) "check_and_change_state_before_execution" runs effectively the
    query: select ... from task_instance join dag_run ... for update
    which FIRST locks TaskInstance and then DagRun table.

This reverse sequence of locking is what causes the deadlock.

The fix is to force locking the DagRun before running the task instance
query that joins dag_run to task_instance.

Fixes: #23361

Report URL: https://github.com/apache/airflow/actions/runs/2731610635

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to