[GitHub] [arrow-datafusion] Dandandan commented on issue #839: Refactor the hash_aggregate

GitBox Mon, 09 Aug 2021 00:43:41 -0700


Dandandan commented on issue #839:
URL: 
https://github.com/apache/arrow-datafusion/issues/839#issuecomment-895017194



   > It's done to the entire record batches. Considering the batches are 1024 
rows with 3 columns, if the key number is 256, then there will be 256 small 
batches after the take, each batch is about 4 rows. `Simd` will make no sense 
in this situation.
   > 
   > >
   
   Yes, but not in the code you linked. There the `take` is done once for the 
arrays in `aggr_input_values`, regardless of how many distinct keys we have in 
the batch. The take only rearranges the data to be in the same order as the 
keys but does it for only once for each input array for the aggregate 
expression values (not even the entire batch).
   
   In other places, yes, operations like `Array::slice` are done for each key 
that occurred which starts to become inefficient when having a lot of small 
groups in the batch and also show up in profiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #839: Refactor the hash_aggregate

Reply via email to