sam-dumont commented on issue #59378:
URL: https://github.com/apache/airflow/issues/59378#issuecomment-4055748325

   We hit this on our prod cluster (830 DAGs, ~25k tasks/day, 2 schedulers, 
KubernetesExecutor, ~500 peak concurrent workers on EKS) and spent 2+ weeks 
instrumenting it. Sharing what we found in case it helps.
   
   We traced it to at least 2 independent race conditions, both unguarded 
UPDATEs on task instance state that can overwrite a RUNNING task :
   
   1. **`schedule_tis()` in `dagrun.py`** : a competing scheduler overwrites 
RUNNING → SCHEDULED. Worker heartbeats, gets 409 with `current_state: 
scheduled`, task killed. PR #60330 by @ephraimbuddy addresses this.
   
   2. **`ti_skip_downstream()` in the Execution API** : `BranchOperator` marks 
unchosen tasks as SKIPPED, but the UPDATE has no state guard, so it can 
overwrite tasks already RUNNING. Worker heartbeats, gets 409 with 
`current_state: skipped`, task killed. We opened PR #63266 for this one.
   
   Both seem to exist in every Airflow 3.x release with 2+ schedulers. The 
frequency likely scales with concurrency : more schedulers and more tasks = 
wider race window.
   
   **What worked for us :**
   
   We patched both code paths at Docker build time (patching the installed 
`.py` files so every process gets the fix). Our 
[`apply_patches.py`](https://gist.github.com/sam-dumont/4bdc214a7673f6571f4fa7058153a43c)
 script does this (auto-detects already-patched files, self-disables when 
upstream fix lands). Got us from 374 errors/day to 0.
   
   One thing we learned the hard way : monkey patches in 
`airflow_local_settings.py` worked for `schedule_tis()` (runs in the scheduler) 
but NOT for `ti_skip_downstream()`. That one runs on the API server 
(FastAPI/uvicorn), which has a different startup path and never imports 
`airflow_local_settings.py`. We had zero 409s for 18h, then `skipped` 409s 
exploded to 131/day when load increased. Build-time patching was the only 
approach that covered all processes.
   
   **Diagnosing which vector you're hitting :**
   
   Parsing the `current_state` field from the 409 response body was the key to 
telling them apart :
   
   | `current_state` | What's happening | Relevant PR |
   |---|---|---|
   | `scheduled` | `schedule_tis()` race : RUNNING → SCHEDULED | PR #60330 |
   | `failed` | Cascade from the above | PR #60330 |
   | `skipped` | `ti_skip_downstream()` race : RUNNING → SKIPPED (branching 
DAGs) | PR #63266 |
   | `not_found` | K8s executor duplicate pod / stale UUID after scheduler 
restart | No fix yet (#57618) |
   
   **Our timeline** (2+ weeks of continuous monitoring) :
   
   | Date | 409s | What changed |
   |---|---|---|
   | Feb 27-Mar 5 | 14-169 | baseline, no fixes |
   | Mar 6 | 22 | `schedule_tis` patch deployed : `scheduled`+`failed` dropped 
to 0 |
   | Mar 7-9 | 3-4 | only `skipped` remaining |
   | Mar 10 | 0 | both patches active |
   | **Mar 13** | **0** | **build-time patches on PROD : 0 race 409s, 0 
orphaned tasks** |
   
   **Note if you're on 3.1.7 specifically :** on top of the race conditions 
above, 3.1.7 has a missing `task_reschedule` index + `DepContext` mutation leak 
(#59604) that slows the scheduler, grows memory (ours peaked at 5.92/6 GiB), 
and widens the race window for both bugs. This created a feedback loop on our 
cluster where 409s escalated exponentially (5 → 26 → 133 → 374/day). Upgrading 
to 3.1.8 broke the loop (#60931, #62089), but the race conditions still need 
the patches on top.
   
   **Related PRs :**
   
   | What | PR | Status |
   |---|---|---|
   | `schedule_tis()` state guard | #60330 (@ephraimbuddy) | Open, validated on 
our prod |
   | `ti_skip_downstream()` state guard | #63266 (ours) | Open |
   | Missing index + DepContext leak (3.1.7 only) | #60931, #62089 | Fixed in 
3.1.8 |
   
   (AI-assisted investigation with Claude Code. All monitoring data from our 
prod cluster.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to