1fanwang opened a new pull request, #66783:
URL: https://github.com/apache/airflow/pull/66783
Under high sensor concurrency, the execution API's reschedule write contends
on the TaskInstance row lock. With MySQL's default `innodb_lock_wait_timeout`
of 50 s, a blocked worker keeps its DB connection idle for nearly a minute
before raising `OperationalError(1205)` — long enough to stack up against the
connection pool and cascade into 5xx responses for the rest of the workload.
The reschedule write is also not wrapped in retry logic, so a single transient
`1213: Deadlock found` or `1205: Lock wait timeout exceeded` fails the sensor
task — even though the operation is idempotent and Airflow already has
`retry_db_transaction` for exactly this pattern (used on `DagRun`,
`RenderedTaskInstanceFields`, `DagWarning`, etc.).
### Fix
Route the `UP_FOR_RESCHEDULE` branch of `PATCH /task-instances/{id}/state`
through a new helper, `_commit_reschedule_state`, decorated with
`@retry_db_transaction(retries=10)`. The helper runs the `SELECT ... FOR
UPDATE` + `INSERT TaskReschedule` + `UPDATE TaskInstance` sequence inside
`_short_lock_wait_timeout(session)`, a context manager that — on MySQL —
temporarily lowers `innodb_lock_wait_timeout` for the duration of the
lock-and-write and restores the previous value before the connection returns to
the pool. Postgres and SQLite are no-ops; they handle this contention
differently.
Together: blocked writes either succeed quickly, deadlock-retry through the
decorator, or fail fast within a few seconds — never block a worker for 50 s.
The short timeout is configurable via the new `[scheduler]
reschedule_lock_timeout_seconds` setting (default `4`).
### Reproducer
A DAG with 50 sensor tasks, each `mode="reschedule"`, `poke_interval=10s`,
all targeting the same DAG run. On a MySQL backend with 8+ schedulers/workers,
running this for ~15 minutes reproduces sporadic `OperationalError: (1205,
'Lock wait timeout exceeded; try restarting transaction')` task failures
alongside `(1213, 'Deadlock found')` retries that escape the route handler.
### Tests
- `test_ti_update_state_reschedule_retries_transient_deadlock` — patches the
reschedule write to raise `DBAPIError` on first call, asserts the retry
decorator re-attempts and the sensor task ends up in `UP_FOR_RESCHEDULE` (not
`FAILED`).
- `test_short_lock_wait_timeout_noop_on_non_mysql` — verifies the context
manager emits no SQL on SQLite/Postgres.
- `test_short_lock_wait_timeout_sets_and_restores_on_mysql` — verifies the
MySQL path issues `SELECT @@SESSION.innodb_lock_wait_timeout`, sets the short
timeout, and restores the original value on exit.
Closes #66778
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]