hkc-8010 opened a new issue, #63975:
URL: https://github.com/apache/airflow/issues/63975
### Apache Airflow version
3.x (FastAPI grid API). Observed concretely on deployments running the Grid
UI against `GET /grid/ti_summaries/{dag_id}` and related grid endpoints.
### What happened?
On real-world DAGs that combine **deep TaskGroups** with **very large
dynamic task mapping** (thousands to tens of thousands of task instances **per
DAG run**), the **API server** can exhibit:
- **Very high memory and CPU** while serving the Grid UI
- **HTTP 500** / **ASGI exceptions** on `GET
.../ui/grid/ti_summaries/{dag_id}/...`
- Under Kubernetes: **OOMKilled (exit 137)** on the **apiserver** container
when memory limits are moderate (e.g. 2 GiB), which in turn surfaces as **no
healthy upstream** behind ingress
This occurs even when the **logical DAG "task" count** (operators + groups)
is modest, because **metadata row count** is dominated by **mapped instances**.
Related prior reports focused on **many dag runs** and/or **~O(100)
structural tasks** (e.g. #57776, #50928). This issue highlights that **per-run
task instance cardinality** from **mapping** can push the **same backend
endpoints** into an even worse scaling regime.
### What you think should happen instead?
- Grid-related API endpoints should **degrade gracefully** (bounded
memory/CPU, optional pagination/streaming chunks, or documented hard limits
with clear errors) for DAG runs with **very large** `task_instance` cardinality.
- Ideally: **do not load the full TI set for a run** into a single
request/response path unless the client explicitly requests it (e.g.
pagination, cursor, or "summary only" without per-map-index detail expansion
where not needed).
### Root cause analysis (backend)
In `airflow-core` FastAPI grid routes:
1. **`GET /grid/ti_summaries/{dag_id}`** (`get_grid_ti_summaries_stream`)
executes, **for each `run_id`**, a query that returns **all** matching
`TaskInstance` rows for that `(dag_id, run_id)` with **no server-side limit**,
then builds summaries in Python and emits NDJSON.
File: `airflow-core/src/airflow/api_fastapi/core_api/routes/ui/grid.py`
Function: `get_grid_ti_summaries_stream` → `_build_ti_summaries`
2. **`_build_ti_summaries` + `_find_aggregates`** (in
`core_api/services/ui/grid.py`) walk the serialized DAG and aggregate
mapped/task-group state. For **mapped** operators, aggregation materializes
**lists of per-instance details**; **task groups** roll up **`details` from
children**, which grows with the number of mapped instances under the subtree.
That implies **CPU and temporary allocations** scale with **TI count × DAG
structure**, even before JSON serialization.
3. **`GET /grid/runs/{dag_id}`** uses `selectinload(DagRun.task_instances)`
(and `task_instances_histories`) for each `DagRun` in the **paginated** run
list. Default API `limit` is modest (`fallback_page_limit` /
`maximum_page_limit`), but **each** run in that page can still attach **every**
`TaskInstance` row (at least for version/bundle resolution), i.e.
**O(limit_runs × TIs_per_run)** ORM rows loaded for one grid request.
Together, a few UI actions (grid open, refresh, multiple users) can drive
**multi-million-row-equivalent** ORM work for DAGs whose **run width** is
dominated by mapping.
### How to reproduce
1. Create a DAG with **nested TaskGroups** and at least one **large**
`expand` / `expand_kwargs` (or multiple mapped branches) so a **single**
`dagrun` has **≥ 5,000** `task_instance` rows (higher is worse).
2. Trigger a run and open the **Grid** view for that `dag_id` / run (or use
the REST/UI calls that hit `ti_summaries` and `grid/runs`).
3. Observe API server **RSS growth**, **latency**, **500s**, and/or **OOM**
under realistic pod limits.
(Internal load tests could also call the public grid endpoints directly with
a generated metadata fixture to avoid sharing customer DAGs.)
### Suggested directions (not prescriptive)
- **Paginate or chunk** TI fetches for `ti_summaries` (by task_id prefix,
task group subtree, map index range, or cursor).
- For **`/grid/runs`**, avoid loading **all** `task_instances` for every run
in the page when the response only needs **dag version / bundle** metadata—use
targeted queries or a slimmer loader.
- In `_find_aggregates`, consider **not** building full **`details`** lists
for large mapped subtrees when the UI contract allows **aggregate-only** nodes
(or cap detail depth with explicit "partial" flags).
### Related issues
- #50928 — Grid view scaling / pagination discussion
- #57776 — Grid performance with many runs and ~180–200 tasks (closed)
- #44685 — Mapped task + grid UI crash (may overlap UX side)
- #58510 — API server DB connection behavior under load
### Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's Code of Conduct
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]