alamb opened a new issue, #17899:
URL: https://github.com/apache/datafusion/issues/17899

   ### Is your feature request related to a problem or challenge?
   
   As reported in https://github.com/apache/datafusion/issues/16620 by 
@debajyoti-truefoundry, evaluting `DISTINCT ON` results in a query plan that 
uses `first_value` aggregates
   
   The current implementation of `first_value` appears to have only a basic 
`Accumulator` implementation, and not the faster  `GroupsAccumulator`: 
https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs#L93
   
   We can very likely improve the performance of such queries significantly by 
implementing a `GroupsAccumulator`  (background in  [Aggregating Millions of 
Groups Fast in Apache Arrow 
DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/) 
)
   
   
   
   ### Describe the solution you'd like
   
   1. Add a benchmark (maybe add a query to the 
[clickbench_extended](https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench)
 suite)
   2. Implement a GroupsAccumulator for first (and maybe nth) value
   
   ### Describe alternatives you've considered
   
   I think the accumulator could be pretty straightforward and track whatever 
groups were new and just copy the first row seen into the output (likely by 
using the `take` filter)
   
   ### Additional context
   
   There is a similar issue for optimizing `array_agg` here:
   - https://github.com/apache/datafusion/issues/10145
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to