[GitHub] [arrow-datafusion] Dandandan opened a new issue #816: Hash join with direct indexing

GitBox Mon, 02 Aug 2021 11:30:09 -0700


Dandandan opened a new issue #816:
URL: https://github.com/apache/arrow-datafusion/issues/816



   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   When the range of the join keys is small and we have integer keys, instead 
of using a hashtable we can use a array/vec which starts on 
   
   Idea from:
   https://github.com/duckdb/duckdb/pull/1959
   
   **Describe the solution you'd like**
   Filter on
   * join constraint should be on single integer (boolean/byte also possible, 
as long as it is limited in domain) keys.
   * no duplicates (?) - not sure whether its required
   * If no duplicates - > we can see how much unique values we have -> should 
be less than max limit, e.g. less than 100K
   * Get statistics on min/max values. It should be a small enough build-side 
(e.g. max 100K key values)
   
   If this is all true we can copy the offsets to somethig like a 
`Vec<Option<u64>>` (or arrow equivalent) which contains the offset at . Each 
element is indexed by `x - MIN(x)`.
   
   Instead of hashing the right-side values and probing, we can compute `y - 
MIN(x)` for the right side and index directly into the array. Checking hash 
collisions is not necessary for this approach.  
   
   **Describe alternatives you've considered**
   
   **Additional context**
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan opened a new issue #816: Hash join with direct indexing

Reply via email to