[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

GitBox Wed, 25 Nov 2020 08:49:44 -0800


Dandandan edited a comment on pull request #8765:
URL: https://github.com/apache/arrow/pull/8765#issuecomment-733824217



   @jorgecarleitao Not really on performance as current benchmarks / queries 
show, just looking at ways to improve the aggregate / join performance.
   
   The main thing I wanted to investigate is whether the aggregates / join can 
be made faster itself. I think one part would be to create a key that can be 
hashed faster. Now the hashing algorithm hashes each value individual 
GroupByValue instead of working on a byte array. The latter one could in 
principle be faster. Some specialized code could also be made for hashing based 
on 1 column only.
   
   It can have a larger impact on _**memory usage**_ though if you are hashing 
/ aggregating something with high cardinality as each key will generate 10s of 
extra bytes based on 16 bytes for each GroupByValue, 8 bytes for using `Vec`  
and 8 bytes for boxing the inner Vec of the aggregation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] Dandandan edited a comment on pull request #8765: ARROW-10722: [Rust][DataFusion] Reduce overhead of some data types in aggregations / joins, improve benchmarks

Reply via email to