kaxil opened a new pull request, #67672:
URL: https://github.com/apache/airflow/pull/67672

   `TriggerRuleDep` sizes a task's upstream set with a `SELECT task_id, 
count(*) ... GROUP BY task_id`, and it only runs that when one of the upstreams 
is mapped. The problem: if several downstream tasks all depend on the same 
mapped upstream, each of them runs the identical query during the same 
scheduling pass. This caches the result on the `DepContext` for the duration of 
a pass, so it runs once per distinct upstream set instead of once per 
downstream.
   
   ## Impact
   
   The case that made me look was a DAG with one mapped upstream feeding 60 
downstream tasks, which fired ~180 of these per DAG run (the 60 downstreams 
across roughly 3 `update_state` passes). That's one per pass now. It all 
happens in the per-run scheduling work, not the serialized critical section, so 
it's a latency thing for mapping-heavy DAGs rather than a throughput change.
   
   ## How it works
   
   - Key is `(dag_id, run_id, frozenset(upstream_task_ids))`, stored on 
`DepContext`, which is already built once per `_get_ready_tis` pass and reused 
for every TI in that pass (the same object that caches `finished_tis`).
   - It only kicks in when the task isn't inside a mapped task group. That's 
the branch where the predicate is plain `task_id IN (upstream_ids)` and comes 
out the same for every downstream with the same upstreams. Tasks inside a 
mapped task group have per map-index predicates, so they keep running their own 
query.
   - `upstream_setup` is still summed in the caller from the cached rows, so 
the setup count stays right per downstream.
   
   ## Staleness
   
   It follows `finished_tis` (a per-pass snapshot), and I clear it in 
`_get_ready_tis` on expansion, since that's when a mapped task's instance count 
actually changes mid-pass. The revise-map-index path can change counts too, but 
it only adds not-yet-finished instances, and an unfinished upstream keeps the 
count-based rules from going ready, so a stale value there doesn't change the 
outcome.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to