kaxil opened a new pull request, #67672: URL: https://github.com/apache/airflow/pull/67672
`TriggerRuleDep` sizes a task's upstream set with a `SELECT task_id, count(*) ... GROUP BY task_id`, and it only runs that when one of the upstreams is mapped. The problem: if several downstream tasks all depend on the same mapped upstream, each of them runs the identical query during the same scheduling pass. This caches the result on the `DepContext` for the duration of a pass, so it runs once per distinct upstream set instead of once per downstream. ## Impact The case that made me look was a DAG with one mapped upstream feeding 60 downstream tasks, which fired ~180 of these per DAG run (the 60 downstreams across roughly 3 `update_state` passes). That's one per pass now. It all happens in the per-run scheduling work, not the serialized critical section, so it's a latency thing for mapping-heavy DAGs rather than a throughput change. ## How it works - Key is `(dag_id, run_id, frozenset(upstream_task_ids))`, stored on `DepContext`, which is already built once per `_get_ready_tis` pass and reused for every TI in that pass (the same object that caches `finished_tis`). - It only kicks in when the task isn't inside a mapped task group. That's the branch where the predicate is plain `task_id IN (upstream_ids)` and comes out the same for every downstream with the same upstreams. Tasks inside a mapped task group have per map-index predicates, so they keep running their own query. - `upstream_setup` is still summed in the caller from the cached rows, so the setup count stays right per downstream. ## Staleness It follows `finished_tis` (a per-pass snapshot), and I clear it in `_get_ready_tis` on expansion, since that's when a mapped task's instance count actually changes mid-pass. The revise-map-index path can change counts too, but it only adds not-yet-finished instances, and an unfinished upstream keeps the count-based rules from going ready, so a stale value there doesn't change the outcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
