[GitHub] [arrow-datafusion] alamb opened a new issue #142: Implement vectorized hashing

GitBox Mon, 26 Apr 2021 06:26:10 -0700


alamb opened a new issue #142:
URL: https://github.com/apache/arrow-datafusion/issues/142



   *Note*: migrated from original JIRA: 
https://issues.apache.org/jira/browse/ARROW-11112
   
   Currently, the approach of the join and hash aggregates is to create a key 
individually from the row values. However, this is far from ideal, as it 
doesn't utilize the cache vectorized nature of Arrow, but instead copies data 
into a vec, traverses multiple arrays in the inner loop, etc.
   
   This blog post has a summary of an approach to do this in a vectorized way.
   
   [https://www.cockroachlabs.com/blog/vectorized-hash-joiner/]
   
    
   
   
   TBD:
   We should decide/find out whether it still makes sense to use rust `HashMap` 
(with () as key?) or whether to create an own? Benefit of using hashmap is that 
there is an API, can resize automatically, and uses SIMD, and also exposes some 
lower level bits we can use here.
   
   
   
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb opened a new issue #142: Implement vectorized hashing

Reply via email to