Re: [PR] Optimize TaskInstance list query by eliminating duplicate joins [airflow]

via GitHub Wed, 18 Feb 2026 02:53:57 -0800


Arunodoy18 commented on PR #62108:
URL: https://github.com/apache/airflow/pull/62108#issuecomment-3920114092


   Thanks for the review! I've added a comprehensive SQL benchmark script to 
demonstrate the optimization and make verification straightforward.
   
   ---
   
   ### 📊 Benchmark Script Added
   
   **Location:** 
[`scripts/tools/benchmark_ti_query.py`](https://github.com/Arunodoy18/airflow/blob/fix/taskinstance-query-performance-62027/scripts/tools/benchmark_ti_query.py)
 (commit [`8e982ad`](https://github.com/Arunodoy18/airflow/commit/8e982ad677))
   
   **Capabilities:**
   - ✅ Uses **real Airflow ORM models** (TaskInstance, DagRun, DagVersion, etc.)
   - ✅ Captures actual SQL via temporary `sqlalchemy.event` hooks (zero 
production code changes)
   - ✅ Builds both BEFORE (joinedload) and AFTER (contains_eager) query variants
   - ✅ Benchmarks with `time.perf_counter` (configurable warmup + iterations)
   - ✅ Optional `EXPLAIN ANALYZE` for PostgreSQL/MySQL
   - ✅ Auto-seeds synthetic dataset if DB is empty
   - ✅ Generates GitHub-ready markdown report
   
   ---
   
   ### 🚀 How to Run
   
   **Inside Breeze (PostgreSQL backend):**
   
   ```bash
   breeze shell --backend postgres
   
   # Full benchmark with dataset matching issue sample load
   python scripts/tools/benchmark_ti_query.py --num-dags 20 --tis-per-dag 200 
-n 50 --explain
   
   # Quick SQL-only preview (instant, no benchmark needed)
   python scripts/tools/benchmark_ti_query.py --sql-only
   ```
   
   **Save outputs for documentation:**
   
   ```bash
   python scripts/tools/benchmark_ti_query.py \
       --num-dags 20 --tis-per-dag 200 -n 50 \
       -o benchmark_results.json \
       --sql-before sql_before.txt \
       --sql-after sql_after.txt \
       --markdown-file pr_benchmark_report.md
   ```
   
   ---
   
   ### 🔍 What This PR Fixes
   
   **Problem:**  
   All three TaskInstance list endpoints (`get_task_instances`, 
`get_mapped_task_instances`, `get_task_instances_batch`) explicitly join 
`dag_run` and `dag_version` tables for filtering/sorting operations:
   
   ```python
   query = (
       select(TI)
       .join(TI.dag_run)           # ← Explicit join for filtering
       .outerjoin(TI.dag_version)  # ← Explicit join for sorting
       .options(...)
   )
   ```
   
   But then `eager_load_TI_and_TIH_for_validation()` uses `joinedload()` to 
eagerly load these **same relationships**, causing SQLAlchemy to emit 
**duplicate LEFT OUTER JOINs**:
   
   ```python
   # BEFORE (baseline — main branch)
   options = (
       joinedload(TI.dag_version).joinedload(DagVersion.bundle),  # ← Duplicate 
JOIN!
       joinedload(TI.dag_run).options(joinedload(DagRun.dag_model)),  # ← 
Duplicate JOIN!
       joinedload(TI.task_instance_note),
   )
   ```
   
   This results in queries with **2-4 redundant JOINs** depending on the 
endpoint.
   
   ---
   
   **Solution:**  
   Added `dag_run_joined` and `dag_version_joined` keyword-only flags to the 
helper function. When `True`, uses `contains_eager()` instead of `joinedload()`:
   
   ```python
   # AFTER (optimized — this PR)
   if dag_run_joined:
       dag_run_opt = 
contains_eager(orm_model.dag_run).options(joinedload(DagRun.dag_model))
   else:
       dag_run_opt = 
joinedload(orm_model.dag_run).options(joinedload(DagRun.dag_model))
   ```
   
   `contains_eager()` tells SQLAlchemy: *"This relationship is already joined — 
populate it from the existing columns instead of adding a new JOIN."*
   
   ---
   
   **Impact:**
   
   | Endpoint | Duplicate JOINs Eliminated |
   |----------|---------------------------|
   | `get_task_instances` | **2** (dag_run, dag_version) |
   | `get_mapped_task_instances` | **2** (dag_run, dag_version) |
   | `get_task_instances_batch` | **4** (triple-joined — had redundant 
post-paginate options) |
   
   **Backward Compatibility:**  
   Both flags default to `False`, preserving existing behavior for callers that 
don't explicitly join (e.g., TaskInstanceHistory endpoints).
   
   ---
   
   ### 📝 Testing
   
   Added 2 new test cases in commit 
[`1b62ddf`](https://github.com/Arunodoy18/airflow/commit/1b62ddf86d):
   
   1. **`test_relationship_fields_loaded_after_query_optimization`** — Verifies 
all relationship fields (`logical_date`, `run_after`, `dag_display_name`, 
`dag_version`, `note`) are still properly loaded in `get_task_instances` 
response
   2. **`test_batch_relationship_fields_loaded_after_query_optimization`** — 
Verifies same for `get_task_instances_batch` including `rendered_fields`
   
   Both tests confirm that switching to `contains_eager` doesn't break response 
serialization.
   
   ---
   
   ### ⚠️ Note
   
   I'm currently on Windows and can't run Breeze locally. If you could run the 
benchmark script to verify the SQL reduction and performance improvement, that 
would be greatly appreciated! 🙏
   
   The `--sql-only` flag provides an instant preview of the query structure 
difference without needing to seed data or run benchmarks:
   
   ```bash
   breeze shell
   python scripts/tools/benchmark_ti_query.py --sql-only
   ```
   
   This will print both the BEFORE and AFTER SQL queries side-by-side for 
manual inspection.
   
   Alternatively, I'm happy to provide manual SQL captures from code inspection 
if that's more convenient.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Optimize TaskInstance list query by eliminating duplicate joins [airflow]

Reply via email to