1fanwang opened a new pull request, #66783:
URL: https://github.com/apache/airflow/pull/66783

   Under high sensor concurrency, the execution API's reschedule write contends 
on the TaskInstance row lock. With MySQL's default `innodb_lock_wait_timeout` 
of 50 s, a blocked worker keeps its DB connection idle for nearly a minute 
before raising `OperationalError(1205)` — long enough to stack up against the 
connection pool and cascade into 5xx responses for the rest of the workload. 
The reschedule write is also not wrapped in retry logic, so a single transient 
`1213: Deadlock found` or `1205: Lock wait timeout exceeded` fails the sensor 
task — even though the operation is idempotent and Airflow already has 
`retry_db_transaction` for exactly this pattern (used on `DagRun`, 
`RenderedTaskInstanceFields`, `DagWarning`, etc.).
   
   ### Fix
   
   Route the `UP_FOR_RESCHEDULE` branch of `PATCH /task-instances/{id}/state` 
through a new helper, `_commit_reschedule_state`, decorated with 
`@retry_db_transaction(retries=10)`. The helper runs the `SELECT ... FOR 
UPDATE` + `INSERT TaskReschedule` + `UPDATE TaskInstance` sequence inside 
`_short_lock_wait_timeout(session)`, a context manager that — on MySQL — 
temporarily lowers `innodb_lock_wait_timeout` for the duration of the 
lock-and-write and restores the previous value before the connection returns to 
the pool. Postgres and SQLite are no-ops; they handle this contention 
differently.
   
   Together: blocked writes either succeed quickly, deadlock-retry through the 
decorator, or fail fast within a few seconds — never block a worker for 50 s.
   
   The short timeout is configurable via the new `[scheduler] 
reschedule_lock_timeout_seconds` setting (default `4`).
   
   ### Reproducer
   
   A DAG with 50 sensor tasks, each `mode="reschedule"`, `poke_interval=10s`, 
all targeting the same DAG run. On a MySQL backend with 8+ schedulers/workers, 
running this for ~15 minutes reproduces sporadic `OperationalError: (1205, 
'Lock wait timeout exceeded; try restarting transaction')` task failures 
alongside `(1213, 'Deadlock found')` retries that escape the route handler.
   
   ### Tests
   
   - `test_ti_update_state_reschedule_retries_transient_deadlock` — patches the 
reschedule write to raise `DBAPIError` on first call, asserts the retry 
decorator re-attempts and the sensor task ends up in `UP_FOR_RESCHEDULE` (not 
`FAILED`).
   - `test_short_lock_wait_timeout_noop_on_non_mysql` — verifies the context 
manager emits no SQL on SQLite/Postgres.
   - `test_short_lock_wait_timeout_sets_and_restores_on_mysql` — verifies the 
MySQL path issues `SELECT @@SESSION.innodb_lock_wait_timeout`, sets the short 
timeout, and restores the original value on exit.
   
   Closes #66778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to