mingmwang commented on PR #6003:
URL: 
https://github.com/apache/arrow-datafusion/pull/6003#issuecomment-1510678020

   > Thank you for your ongoing efforts in improving the performance of 
aggregate functions. Really appreciate it!
   > 
   > If I understand the core concept of this PR correctly: the current 
aggregate process, when dealing with high-cardinality aggregations, suffers 
from low efficiency due to the frequent generation of small arrays from 
slicing. To address this, you've changed the approach for high-cardinality 
aggregations by generating `ScalarValue` or `Vec<ScalarValue>` instead of 
creating arrays through slicing.
   > 
   > Nonetheless, generating `ScalarValue`s could still lead to some overhead. 
As an alternative, I propose the following: we could add a set of methods to 
`arrow-arith`'s `aggregate` functions, allowing them to accept a selection 
vector (also suggested by @alamb previously in [#5944 
(comment)](https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502083042)).
 This way, we can avoid creating slices and generating `ScalarValue` at the 
same time.
   
   Yes, you are right. I agree that generating `ScalarValue` will involve some 
overhead.  I'm not sure whether I can get rid from the `ScalarValue`s and 
generate the rust native type and use the native type to update the accumulator 
states directly, let me do some experiment on this.  Regarding the other 
approach you have proposed, I'm open to that, but currently because the 
arrow-rs aggregation framework does not provide such capabilities, I think when 
it is ready, we can try this approach also and with the benchmark result and 
finally decide which way to go. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to