hkc-8010 opened a new issue, #63975:
URL: https://github.com/apache/airflow/issues/63975

   ### Apache Airflow version
   
   3.x (FastAPI grid API). Observed concretely on deployments running the Grid 
UI against `GET /grid/ti_summaries/{dag_id}` and related grid endpoints.
   
   ### What happened?
   
   On real-world DAGs that combine **deep TaskGroups** with **very large 
dynamic task mapping** (thousands to tens of thousands of task instances **per 
DAG run**), the **API server** can exhibit:
   
   - **Very high memory and CPU** while serving the Grid UI
   - **HTTP 500** / **ASGI exceptions** on `GET 
.../ui/grid/ti_summaries/{dag_id}/...`
   - Under Kubernetes: **OOMKilled (exit 137)** on the **apiserver** container 
when memory limits are moderate (e.g. 2 GiB), which in turn surfaces as **no 
healthy upstream** behind ingress
   
   This occurs even when the **logical DAG "task" count** (operators + groups) 
is modest, because **metadata row count** is dominated by **mapped instances**.
   
   Related prior reports focused on **many dag runs** and/or **~O(100) 
structural tasks** (e.g. #57776, #50928). This issue highlights that **per-run 
task instance cardinality** from **mapping** can push the **same backend 
endpoints** into an even worse scaling regime.
   
   ### What you think should happen instead?
   
   - Grid-related API endpoints should **degrade gracefully** (bounded 
memory/CPU, optional pagination/streaming chunks, or documented hard limits 
with clear errors) for DAG runs with **very large** `task_instance` cardinality.
   - Ideally: **do not load the full TI set for a run** into a single 
request/response path unless the client explicitly requests it (e.g. 
pagination, cursor, or "summary only" without per-map-index detail expansion 
where not needed).
   
   ### Root cause analysis (backend)
   
   In `airflow-core` FastAPI grid routes:
   
   1. **`GET /grid/ti_summaries/{dag_id}`** (`get_grid_ti_summaries_stream`) 
executes, **for each `run_id`**, a query that returns **all** matching 
`TaskInstance` rows for that `(dag_id, run_id)` with **no server-side limit**, 
then builds summaries in Python and emits NDJSON.
   
      File: `airflow-core/src/airflow/api_fastapi/core_api/routes/ui/grid.py`  
      Function: `get_grid_ti_summaries_stream` → `_build_ti_summaries`
   
   2. **`_build_ti_summaries` + `_find_aggregates`** (in 
`core_api/services/ui/grid.py`) walk the serialized DAG and aggregate 
mapped/task-group state. For **mapped** operators, aggregation materializes 
**lists of per-instance details**; **task groups** roll up **`details` from 
children**, which grows with the number of mapped instances under the subtree. 
That implies **CPU and temporary allocations** scale with **TI count × DAG 
structure**, even before JSON serialization.
   
   3. **`GET /grid/runs/{dag_id}`** uses `selectinload(DagRun.task_instances)` 
(and `task_instances_histories`) for each `DagRun` in the **paginated** run 
list. Default API `limit` is modest (`fallback_page_limit` / 
`maximum_page_limit`), but **each** run in that page can still attach **every** 
`TaskInstance` row (at least for version/bundle resolution), i.e. 
**O(limit_runs × TIs_per_run)** ORM rows loaded for one grid request.
   
   Together, a few UI actions (grid open, refresh, multiple users) can drive 
**multi-million-row-equivalent** ORM work for DAGs whose **run width** is 
dominated by mapping.
   
   ### How to reproduce
   
   1. Create a DAG with **nested TaskGroups** and at least one **large** 
`expand` / `expand_kwargs` (or multiple mapped branches) so a **single** 
`dagrun` has **≥ 5,000** `task_instance` rows (higher is worse).
   2. Trigger a run and open the **Grid** view for that `dag_id` / run (or use 
the REST/UI calls that hit `ti_summaries` and `grid/runs`).
   3. Observe API server **RSS growth**, **latency**, **500s**, and/or **OOM** 
under realistic pod limits.
   
   (Internal load tests could also call the public grid endpoints directly with 
a generated metadata fixture to avoid sharing customer DAGs.)
   
   ### Suggested directions (not prescriptive)
   
   - **Paginate or chunk** TI fetches for `ti_summaries` (by task_id prefix, 
task group subtree, map index range, or cursor).
   - For **`/grid/runs`**, avoid loading **all** `task_instances` for every run 
in the page when the response only needs **dag version / bundle** metadata—use 
targeted queries or a slimmer loader.
   - In `_find_aggregates`, consider **not** building full **`details`** lists 
for large mapped subtrees when the UI contract allows **aggregate-only** nodes 
(or cap detail depth with explicit "partial" flags).
   
   ### Related issues
   
   - #50928 — Grid view scaling / pagination discussion  
   - #57776 — Grid performance with many runs and ~180–200 tasks (closed)  
   - #44685 — Mapped task + grid UI crash (may overlap UX side)  
   - #58510 — API server DB connection behavior under load  
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's Code of Conduct
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to