avolant opened a new pull request, #67246:
URL: https://github.com/apache/airflow/pull/67246
## Problem
`PATCH /execution/task-instances/{id}/state` calls:
```python
ti = session.get(TI, task_instance_id, with_for_update=True)
```
The `TI` mapper has `dag_run = relationship(..., lazy="joined")`, so this
emits:
```sql
SELECT task_instance.*, dag_run.*
FROM task_instance
JOIN dag_run ON dag_run.dag_id = task_instance.dag_id
AND dag_run.run_id = task_instance.run_id
WHERE task_instance.id = %s
FOR UPDATE
```
`FOR UPDATE` with no `OF` clause locks every relation in the FROM list —
both `task_instance` **and** `dag_run`. Under concurrent task completions in
the same DAG run, all workers serialise on a single `dag_run` row, deadlocking
with the scheduler's `TriggerRuleDep` queries (which run in transactions that
also touch `task_instance`).
Observed in production: ~2,000 `psycopg2.errors.DeadlockDetected` +
widespread `statement_timeout` (5 s) errors per hour on the `PATCH .../state`
endpoint, with a mean execution time of 4,297 ms (pure lock-wait; the query
does zero disk I/O).
Three other call sites in this file already use `with_for_update(of=TI)` for
exactly this reason (lines 152, 339, 697). The two remaining bare
`with_for_update=True` calls are the ones hit on every normal task completion.
## Fix
Scope the lock to `task_instance` only:
```python
ti = session.get(TI, task_instance_id, with_for_update={"of": TI})
```
This produces `FOR UPDATE OF task_instance`, leaving `dag_run` unlocked and
breaking the deadlock cycle.
## Testing
Existing unit tests cover the state-update path. No behaviour change — the
function still reads the `dag_run` joinedload; it just no longer holds a row
lock on it.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]