mingmwang commented on PR #6003: URL: https://github.com/apache/arrow-datafusion/pull/6003#issuecomment-1510678020
> Thank you for your ongoing efforts in improving the performance of aggregate functions. Really appreciate it! > > If I understand the core concept of this PR correctly: the current aggregate process, when dealing with high-cardinality aggregations, suffers from low efficiency due to the frequent generation of small arrays from slicing. To address this, you've changed the approach for high-cardinality aggregations by generating `ScalarValue` or `Vec<ScalarValue>` instead of creating arrays through slicing. > > Nonetheless, generating `ScalarValue`s could still lead to some overhead. As an alternative, I propose the following: we could add a set of methods to `arrow-arith`'s `aggregate` functions, allowing them to accept a selection vector (also suggested by @alamb previously in [#5944 (comment)](https://github.com/apache/arrow-datafusion/issues/5944#issuecomment-1502083042)). This way, we can avoid creating slices and generating `ScalarValue` at the same time. Yes, you are right. I agree that generating `ScalarValue` will involve some overhead. I'm not sure whether I can get rid from the `ScalarValue`s and generate the rust native type and use the native type to update the accumulator states directly, let me do some experiment on this. Regarding the other approach you have proposed, I'm open to that, but currently because the arrow-rs aggregation framework does not provide such capabilities, I think when it is ready, we can try this approach also and with the benchmark result and finally decide which way to go. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
