kosiew opened a new pull request, #17875:
URL: https://github.com/apache/datafusion/pull/17875
## Which issue does this PR close?
* Closes #16684.
## Rationale for this change
The projection-pruning rule in the optimizer previously treated any
`SubqueryAlias` whose alias name did not exactly match the CTE name as an
"other subquery", and therefore aborted projection pushdown for that branch.
This incorrectly prevented projection pushdown when the recursive CTE
referenced itself with an alias (for example `FROM nodes AS child`).
Because the optimizer could not recognize that the aliased reference still
targeted the same CTE, it conservatively kept all columns on the table scan
that feeds the recursive branch. In practice this can cause unnecessary I/O
(for example with Parquet) because columns not required by the final output are
read.
This change allows the optimizer to detect aliased self-references inside a
recursive CTE and continue projection pushdown into the recursive term when
safe.
## What changes are included in this PR?
* Modify `plan_contains_other_subqueries` in
`datafusion/optimizer/src/optimize_projections/mod.rs` so that a
`SubqueryAlias` whose alias name differs from the CTE name is **not**
immediately treated as an unrelated subquery if the aliased input ultimately
targets the same CTE. Instead we call a helper to detect whether the aliased
subquery actually targets the recursive CTE.
* Add helper function `subquery_alias_targets_recursive_cte` to
`optimize_projections/mod.rs` which recursively walks a plan (through
`SubqueryAlias` and single-input operators) to determine whether the leaf
`TableScan` refers to the CTE name.
* Add an integration test `recursive_cte_with_aliased_self_reference` in
`datafusion/optimizer/tests/optimizer_integration.rs` which asserts that
projection pushdown occurs when a recursive CTE references itself with an
alias. The test checks that only the projected column (`id`) is kept in the
`TableScan` of the recursive branch.
Files changed (summary):
* `datafusion/optimizer/src/optimize_projections/mod.rs`
* Allow descending into aliased subqueries to see if they target the same
recursive CTE.
* Add `subquery_alias_targets_recursive_cte`.
* `datafusion/optimizer/tests/optimizer_integration.rs`
* Add `recursive_cte_with_aliased_self_reference` test.
## Are these changes tested?
Yes — this PR adds an integration test
(`recursive_cte_with_aliased_self_reference`) that reproduces the problematic
scenario and validates the expected plan after optimization. Existing test
harness/tooling runs the new test as part of the optimizer integration suite.
## Are there any user-facing changes?
No changes to public APIs or SQL syntax. This is an internal optimizer
improvement which can reduce unnecessary I/O by enabling projection pushdown in
more recursive-CTE cases (when the recursive term uses an alias for the CTE).
There are no breaking changes.
## Additional notes / implementation details
* The heuristic used by `subquery_alias_targets_recursive_cte` is
intentionally conservative: it only walks through `SubqueryAlias` and operators
with a single input. If a plan node has multiple inputs (e.g. join) the helper
returns `false` so we do not accidentally mis-detect unrelated plans as
targeting the CTE.
* The change preserves safety by only allowing pushdown when we can be
confident the aliased subquery resolves back to the same CTE's table scan.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]