GitHub user hezeclark added a comment to the discussion: Race condition in 
CeleryExecutor with multiple schedulers causing duplicate TaskInstance execution

Duplicate TaskInstance execution with multiple schedulers and CeleryExecutor is 
a known race condition related to the scheduler heartbeat and task claiming 
mechanism.

The root causes:

1. Multiple schedulers racing to queue the same TaskInstance — the 
database-level locking in Airflow 3.x uses SELECT FOR UPDATE SKIP LOCKED but 
can still have race conditions under high load.

2. Celery at-least-once delivery (TASK_ACKS_LATE) causing redelivery if a 
worker acks late.

Mitigations:

1. Enable executor-level deduplication — make sure all schedulers share the 
same database and the scheduler_heartbeat_sec is not too low:
scheduler_heartbeat_sec = 5  # default, increase under load
max_tis_per_query = 512  # reduce to lower contention

2. Use database-level advisory locks — Airflow 3 supports PostgreSQL advisory 
locks for scheduler coordination:
[scheduler]
use_row_level_locking = True

3. For Celery + Redis: set broker_transport_options correctly:
[celery_broker_transport_options]
visibility_timeout = 86400  # must be > your longest task

4. Consider switching to the Database Executor (Airflow 3.x) for 
multi-scheduler setups — it uses database-level SKIP LOCKED which is more 
reliable than Celery for deduplication.

For immediate mitigation: reduce the number of schedulers to 2 and verify all 
schedulers are on the same database with high availability (primary + read 
replica is fine, but all schedulers must write to the same primary).

GitHub link: 
https://github.com/apache/airflow/discussions/63249#discussioncomment-16112341

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to