alamb opened a new pull request, #20749: URL: https://github.com/apache/datafusion/pull/20749
Draft until: - [ ] New tests merge: https://github.com/apache/datafusion/pull/20723 - [ ] Run performance benchmarks ## Which issue does this PR close? - Part of #18489 - Closes https://github.com/apache/datafusion/pull/20180 - Closes https://github.com/apache/datafusion/issues/15524 - Replaces https://github.com/apache/datafusion/pull/20665 ## Rationale for this change I [want DataFusion to be the fastest paruqet engine on ClickBench](https://github.com/apache/datafusion/issues/18489). One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern of many aggregate functions that are offset by a constant: https://github.com/apache/datafusion/blob/0ca9d6586a43c323525b2e299448e0f1af4d6195/benchmarks/queries/clickbench/queries/q29.sql#L4 This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case for this pattern. ClickHouse probably does too. See - https://github.com/duckdb/duckdb/pull/15017 - Discussion on https://github.com/apache/datafusion/issues/15524 Thus I reluctantly conclude that we should have one too. ## What changes are included in this PR? This is an alternate to my first attempt. - https://github.com/apache/datafusion/pull/20665 In particular, since this is such a ClickBench specific rule, I wanted to 1. Minimize the downstream API / upgrade impact (aka not change existing APIs) 2. Optimize performance for the case where this rewrite will not apply (most times) 1. Add a rewrite `SUM(expr + scalar)` --> `SUM(expr) + scalar*COUNT(expr)` 3. Tests for same Note there are quite a few other ideas to potentially make this more general on https://github.com/apache/datafusion/issues/15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench) ## Are these changes tested? Yes, new tests are added ## Are there any user-facing changes? Faster performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
