alamb opened a new pull request, #20665: URL: https://github.com/apache/datafusion/pull/20665
Draft until: - [ ] New PR for new clickbench explain tests - [ ] Break out comment improvements - [ ] Run performance benchmarks ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/15524 - Closes https://github.com/apache/datafusion/pull/20180 - Closes https://github.com/apache/datafusion/issues/15524 ## Rationale for this change I [want DataFusion to be the fastest paruqet engine on ClickBench](https://github.com/apache/datafusion/issues/15524). One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern: https://github.com/apache/datafusion/blob/0ca9d6586a43c323525b2e299448e0f1af4d6195/benchmarks/queries/clickbench/queries/q29.sql#L4 This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case. See - https://github.com/duckdb/duckdb/pull/15017 - Discussion on https://github.com/apache/datafusion/issues/15524 Thus I reluctantly conclude that we should have one too. ## What changes are included in this PR? 1. Add a rewrite `SUM(expr + scalar)` --> `SUM(expr) + scalar*COUNT(expr)` 2. Tests for same This is implemented as a `AggregateUDF::simplify` rule as discussed on https://github.com/apache/datafusion/pull/20180#issuecomment-3881843201 and suggested by @UBarney Note there are quite a few other ideas to potentially make this more general on https://github.com/apache/datafusion/issues/15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench) ## Are these changes tested? Yes, new tests are added ## Are there any user-facing changes? Faster performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
