alamb opened a new issue, #17899: URL: https://github.com/apache/datafusion/issues/17899
### Is your feature request related to a problem or challenge? As reported in https://github.com/apache/datafusion/issues/16620 by @debajyoti-truefoundry, evaluting `DISTINCT ON` results in a query plan that uses `first_value` aggregates The current implementation of `first_value` appears to have only a basic `Accumulator` implementation, and not the faster `GroupsAccumulator`: https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs#L93 We can very likely improve the performance of such queries significantly by implementing a `GroupsAccumulator` (background in [Aggregating Millions of Groups Fast in Apache Arrow DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/) ) ### Describe the solution you'd like 1. Add a benchmark (maybe add a query to the [clickbench_extended](https://github.com/apache/datafusion/tree/main/benchmarks/queries/clickbench) suite) 2. Implement a GroupsAccumulator for first (and maybe nth) value ### Describe alternatives you've considered I think the accumulator could be pretty straightforward and track whatever groups were new and just copy the first row seen into the output (likely by using the `take` filter) ### Additional context There is a similar issue for optimizing `array_agg` here: - https://github.com/apache/datafusion/issues/10145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
