Arunodoy18 commented on PR #62108: URL: https://github.com/apache/airflow/pull/62108#issuecomment-3920114092
Thanks for the review! I've added a comprehensive SQL benchmark script to demonstrate the optimization and make verification straightforward. --- ### 📊 Benchmark Script Added **Location:** [`scripts/tools/benchmark_ti_query.py`](https://github.com/Arunodoy18/airflow/blob/fix/taskinstance-query-performance-62027/scripts/tools/benchmark_ti_query.py) (commit [`8e982ad`](https://github.com/Arunodoy18/airflow/commit/8e982ad677)) **Capabilities:** - ✅ Uses **real Airflow ORM models** (TaskInstance, DagRun, DagVersion, etc.) - ✅ Captures actual SQL via temporary `sqlalchemy.event` hooks (zero production code changes) - ✅ Builds both BEFORE (joinedload) and AFTER (contains_eager) query variants - ✅ Benchmarks with `time.perf_counter` (configurable warmup + iterations) - ✅ Optional `EXPLAIN ANALYZE` for PostgreSQL/MySQL - ✅ Auto-seeds synthetic dataset if DB is empty - ✅ Generates GitHub-ready markdown report --- ### 🚀 How to Run **Inside Breeze (PostgreSQL backend):** ```bash breeze shell --backend postgres # Full benchmark with dataset matching issue sample load python scripts/tools/benchmark_ti_query.py --num-dags 20 --tis-per-dag 200 -n 50 --explain # Quick SQL-only preview (instant, no benchmark needed) python scripts/tools/benchmark_ti_query.py --sql-only ``` **Save outputs for documentation:** ```bash python scripts/tools/benchmark_ti_query.py \ --num-dags 20 --tis-per-dag 200 -n 50 \ -o benchmark_results.json \ --sql-before sql_before.txt \ --sql-after sql_after.txt \ --markdown-file pr_benchmark_report.md ``` --- ### 🔍 What This PR Fixes **Problem:** All three TaskInstance list endpoints (`get_task_instances`, `get_mapped_task_instances`, `get_task_instances_batch`) explicitly join `dag_run` and `dag_version` tables for filtering/sorting operations: ```python query = ( select(TI) .join(TI.dag_run) # ← Explicit join for filtering .outerjoin(TI.dag_version) # ← Explicit join for sorting .options(...) ) ``` But then `eager_load_TI_and_TIH_for_validation()` uses `joinedload()` to eagerly load these **same relationships**, causing SQLAlchemy to emit **duplicate LEFT OUTER JOINs**: ```python # BEFORE (baseline — main branch) options = ( joinedload(TI.dag_version).joinedload(DagVersion.bundle), # ← Duplicate JOIN! joinedload(TI.dag_run).options(joinedload(DagRun.dag_model)), # ← Duplicate JOIN! joinedload(TI.task_instance_note), ) ``` This results in queries with **2-4 redundant JOINs** depending on the endpoint. --- **Solution:** Added `dag_run_joined` and `dag_version_joined` keyword-only flags to the helper function. When `True`, uses `contains_eager()` instead of `joinedload()`: ```python # AFTER (optimized — this PR) if dag_run_joined: dag_run_opt = contains_eager(orm_model.dag_run).options(joinedload(DagRun.dag_model)) else: dag_run_opt = joinedload(orm_model.dag_run).options(joinedload(DagRun.dag_model)) ``` `contains_eager()` tells SQLAlchemy: *"This relationship is already joined — populate it from the existing columns instead of adding a new JOIN."* --- **Impact:** | Endpoint | Duplicate JOINs Eliminated | |----------|---------------------------| | `get_task_instances` | **2** (dag_run, dag_version) | | `get_mapped_task_instances` | **2** (dag_run, dag_version) | | `get_task_instances_batch` | **4** (triple-joined — had redundant post-paginate options) | **Backward Compatibility:** Both flags default to `False`, preserving existing behavior for callers that don't explicitly join (e.g., TaskInstanceHistory endpoints). --- ### 📝 Testing Added 2 new test cases in commit [`1b62ddf`](https://github.com/Arunodoy18/airflow/commit/1b62ddf86d): 1. **`test_relationship_fields_loaded_after_query_optimization`** — Verifies all relationship fields (`logical_date`, `run_after`, `dag_display_name`, `dag_version`, `note`) are still properly loaded in `get_task_instances` response 2. **`test_batch_relationship_fields_loaded_after_query_optimization`** — Verifies same for `get_task_instances_batch` including `rendered_fields` Both tests confirm that switching to `contains_eager` doesn't break response serialization. --- ### ⚠️ Note I'm currently on Windows and can't run Breeze locally. If you could run the benchmark script to verify the SQL reduction and performance improvement, that would be greatly appreciated! 🙏 The `--sql-only` flag provides an instant preview of the query structure difference without needing to seed data or run benchmarks: ```bash breeze shell python scripts/tools/benchmark_ti_query.py --sql-only ``` This will print both the BEFORE and AFTER SQL queries side-by-side for manual inspection. Alternatively, I'm happy to provide manual SQL captures from code inspection if that's more convenient. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
