kalluripradeep opened a new pull request, #64503:
URL: https://github.com/apache/airflow/pull/64503
When LocalExecutor runs with high parallelism, a race condition can occur:
a task instance is completed/deleted between the time
`_check_for_removed_or_restored_tasks` loads TIs into the session and
the time `session.flush()` is called inside `_create_task_instances`.
This raises a `StaleDataError` (SQLAlchemy ORM optimistic locking
violation) which was previously uncaught — crashing the scheduler
entirely instead of recovering gracefully.
The key reason it slipped through: `StaleDataError` is **not** a
subclass of `DBAPIError`, so it bypassed both the
`except IntegrityError` guard in `_create_task_instances` **and** the
tenacity retry wrapper in `run_with_db_retries`.
**Changes:**
- Catch `StaleDataError` alongside `IntegrityError` in
`_create_task_instances` and roll back the session safely
- Add `StaleDataError` to the tenacity retry list in
`run_with_db_retries` so the scheduling loop retries the transient
race condition
**Tests added:**
- `test_verify_integrity_handles_stale_data_error` — verifies
`StaleDataError` during `session.flush()` is caught and
`session.rollback()` is called
- `test_retry_db_transaction_with_stale_data_error` — verifies
`StaleDataError` is retried 3 times by `run_with_db_retries`
Fixes #63926
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]