alamb commented on issue #16620: URL: https://github.com/apache/datafusion/issues/16620#issuecomment-3365476837
The obvious difference between the plan from what I can see is the inclusion of `first_value` in the `DISTINCT ON` query So I think the short answer is "it is likely we could improve the performance fo distinct on columns, by improvung the performance of the `first_value` aggregate" I took a brief look at the implementation: https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs#L93 And it does not seem to have a `GroupsAccumulator` (background in [Aggregating Millions of Groups Fast in Apache Arrow DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/) ) There is a similar issue for optimizing `array_agg` here: - https://github.com/apache/datafusion/issues/10145 I will file a corresponding ticket for implement GroupsAccumulator for `first_value` / `nth value` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
