alamb opened a new pull request, #20665:
URL: https://github.com/apache/datafusion/pull/20665

   Draft until:
   - [ ] New PR for new clickbench explain tests
   - [ ] Break out comment improvements
   - [ ] Run performance benchmarks
   
   
   ## Which issue does this PR close?
   
   - Part of https://github.com/apache/datafusion/issues/15524
   - Closes https://github.com/apache/datafusion/pull/20180
   - Closes https://github.com/apache/datafusion/issues/15524
   
   ## Rationale for this change
   
   I [want DataFusion to be the fastest paruqet engine on 
ClickBench](https://github.com/apache/datafusion/issues/15524). One of the 
queries where DataFusion is significantly slower is Query 29 which has a very 
strange pattern:
   
   
https://github.com/apache/datafusion/blob/0ca9d6586a43c323525b2e299448e0f1af4d6195/benchmarks/queries/clickbench/queries/q29.sql#L4
   
   This is not a pattern I have ever seen in a real query, but it seems like 
the engine currently at the top of the ClickBench leaderboard has a special 
case. See
   - https://github.com/duckdb/duckdb/pull/15017
   - Discussion on https://github.com/apache/datafusion/issues/15524
   
   Thus I reluctantly conclude that we should have one too. 
   
   ## What changes are included in this PR?
   
   1. Add a rewrite `SUM(expr + scalar)` --> `SUM(expr) + scalar*COUNT(expr)` 
   2. Tests for same
   
   This is implemented as a `AggregateUDF::simplify` rule as discussed on 
https://github.com/apache/datafusion/pull/20180#issuecomment-3881843201 and 
suggested by @UBarney 
   
   Note there are quite a few other ideas to potentially make this more general 
on https://github.com/apache/datafusion/issues/15524 but I am going with the 
simple thing of making it work for the usecase we have in hand (ClickBench)
   
   ## Are these changes tested?
   
   Yes, new tests are added
   
   ## Are there any user-facing changes?
   
   Faster performance
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to