nathanb9 opened a new pull request, #22551: URL: https://github.com/apache/datafusion/pull/22551
## Summary Adds support for materializing Common Table Expressions (CTEs) that are referenced more than once. When enabled, multi-referenced CTEs ending in expensive operations (Aggregate, Distinct, Window, Union) are computed once and cached in memory for reuse. - Implements DuckDB-inspired heuristic: only materialize CTEs ending in expensive operations - Uses Extension nodes to avoid modifying core LogicalPlan enum - Handles nested CTE dependencies with correct execution ordering - Gated behind `enable_materialized_ctes` config (default: true) - Respects explicit `MATERIALIZED` / `NOT MATERIALIZED` SQL hints (PostgreSQL dialect) ## Benchmark Results (TPC-DS SF1, 10 iterations) | Query | Baseline | Materialized | Speedup | |-------|----------|--------------|---------| | Q47 | 401ms | 141ms | **2.85x** | | Q57 | 112ms | 42ms | **2.67x** | | Q2 | 101ms | 64ms | **1.58x** | | Q74 | 311ms | 164ms | **1.90x** | | Q75 | 192ms | 164ms | **1.17x** | Known limitation: CTEs where the outer query filters on different grouping key values per reference (e.g., TPC-DS Q39) may regress. Users can opt out with `NOT MATERIALIZED`. ## Test plan - [x] Unit tests for materialization logic (7 tests in sql_integration) - [x] All existing CTE tests pass (recursive CTEs unaffected) - [x] TPC-DS SF1 full suite (98/99 queries pass, Q30 has pre-existing schema error) - [x] Verified no regressions on Q64 (dependency ordering) Closes https://github.com/apache/datafusion/issues/17737 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
