Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217


   @jorgecarleitao Not really on performance as current benchmarks / queries 
show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can 
be made faster itself. I think one part would be to create a key that can be 
hashed faster. Now the hashing algorithm hashes each value individual 
GroupByValue instead of working on a byte array. The latter one could in 
principle be faster. Some specialized code could also be made for hashing based 
on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing 
/ aggregating something with high cardinality as each key will generate 10s of 
extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  
and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to