Re: [I] Performance of `distinct on (columns)` [datafusion]

via GitHub Sat, 18 Oct 2025 04:53:45 -0700


alamb commented on issue #16620:
URL: https://github.com/apache/datafusion/issues/16620#issuecomment-3365476837


   The obvious difference between the plan from what I can see is the inclusion 
of `first_value` in the `DISTINCT ON` query
   
   So I think the short answer is "it is likely we could improve the 
performance fo distinct on columns, by improvung the performance of the 
`first_value` aggregate"
   
   I took a brief look at the implementation: 
https://github.com/apache/datafusion/blob/main/datafusion/functions-aggregate/src/nth_value.rs#L93
   
   And it does not seem to have a `GroupsAccumulator`  (background in  
[Aggregating Millions of Groups Fast in Apache Arrow 
DataFusion](https://arrow.apache.org/blog/2023/08/05/datafusion_fast_grouping/) 
)
   
   There is a similar issue for optimizing `array_agg` here:
   - https://github.com/apache/datafusion/issues/10145
   
   I will file a corresponding ticket for implement GroupsAccumulator for 
`first_value` / `nth value`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Performance of `distinct on (columns)` [datafusion]

Reply via email to