neilconway opened a new pull request, #21726:
URL: https://github.com/apache/datafusion/pull/21726
## Which issue does this PR close?
- Closes #21724.
## Rationale for this change
Some profiling suggested that `OptimizeProjections` was among the most
heavyweight of the logical optimizer passes for TPC-DS. This PR implements two
distinct optimizations:
1. In `RequiredIndices::add_expr`, the previous implementation created a
`HashSet` and walked the expression tree twice, adding reference columns to the
`HashSet`. Finally, members of the `HashSet` were converted to indices. It is
faster to just walk the expression tree once ourselves and convert column
references to indices. This saves the HashSet allocation and insertions, plus
one redundant tree walk.
2. In `optimize_projections`, we computed the minimal required set of `GROUP
BY` columns, based on functional dependencies. This was relatively expensive;
when there are no functional dependencies (common), this was still quite
expensive but will always be a no-op. Add a short-circuit to skip the redundant
computation in this scenario.
Results on a newly added `optimize_projections` microbenchmark:
```
- tpch_q3: 14.6 µs → 11.9 µs (−18.5%)
- tpch_q5: 17.4 µs → 14.0 µs (−19.4%)
- clickbench_groupby: 10.3 µs → 6.8 µs (−34.1%)
- tpcds_subquery: 11.2 µs → 8.7 µs (−22.1%)
- small_schema: 1.87 µs → 1.68 µs (−10.3%)
```
## What changes are included in this PR?
* Add microbenchmark for `optimize_projections`
* Implement two optimizations
## Are these changes tested?
Yes.
## Are there any user-facing changes?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]