[PR] fix: swallow OperationalError in TriggerRunnerSupervisor.clean_unused() to prevent triggerer CLBO [airflow]

via GitHub Mon, 08 Jun 2026 06:07:35 -0700


safaehar opened a new pull request, #68227:
URL: https://github.com/apache/airflow/pull/68227


   ## Motivation
   
   `TriggerRunnerSupervisor.clean_unused()` calls `Trigger.clean_unused()`, 
which executes a `DELETE FROM trigger WHERE ...` that joins against 
`task_instance`. Under row-level lock contention — specifically when the 
triggerer's own `SELECT ... FOR UPDATE` queries hold locks on trigger rows 
while async coroutine work is in progress — the DELETE blocks waiting for those 
locks. If the wait exceeds the database `statement_timeout`, PostgreSQL raises 
`QueryCanceled`, which SQLAlchemy surfaces as `OperationalError`.
   
   This exception propagates unhandled up through 
`TriggerRunnerSupervisor.run()`, crashing the triggerer process into 
CrashLoopBackOff.
   
   This was observed in production (PostgreSQL metaDB, Airflow 3.2.1) across 
multiple workergroup deployments with 20–32 triggerer restarts over 3 days.
   
   ## Changes
   
   - Wrap `Trigger.clean_unused()` in `TriggerRunnerSupervisor.clean_unused()` 
to catch `OperationalError` and log a warning instead of propagating the 
exception
   - Add `sqlalchemy.exc` to the existing `sqlalchemy` import
   
   ## Why this is safe
   
   `clean_unused()` is best-effort periodic housekeeping. Orphaned trigger rows 
sitting in the database for one extra heartbeat cycle (~1s) have no functional 
impact — triggers still fire, deferrable tasks still run. The cleanup retries 
on the next heartbeat. Crashing the triggerer over a transient DB error is 
strictly worse than skipping one cleanup cycle.
   
   ## Alternatives considered
   
   - Fixing the lock contention directly: the triggerer's `SELECT ... FOR 
UPDATE` pattern is intentional for claiming triggers. Reducing contention via 
`idle_in_transaction_session_timeout` on the DB side helps but doesn't 
eliminate the race window.
   - Re-raising after N consecutive failures: adds complexity for limited 
benefit — a persistent DB outage would surface through other signals (heartbeat 
failures, liveness probes) before the retry count mattered.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix: swallow OperationalError in TriggerRunnerSupervisor.clean_unused() to prevent triggerer CLBO [airflow]

Reply via email to