alamb opened a new issue #822:
URL: https://github.com/apache/arrow-datafusion/issues/822


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   The `create_hash` function is responsible for hashing values in arrays. At 
the moment, however, it (effectively) hashes NULL values to `0` for all types, 
which likely leads to sub optimial behavior such as @Dandandan observed in 
https://github.com/apache/arrow-datafusion/pull/812#discussion_r682319823 that 
`NULL,1` and `1,NULL` will hash to the same value.
   
   **Describe the solution you'd like**
   TBD
   
   **Describe alternatives you've considered**
   @jorgecarleitao 's comment (copied below) from 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731 
offers a few alternatives:
   
   From the hashing side, an unknown to me atm is how to efficiently hash 
`values+validity`. I.e. given `V = ["a", "", "c"]` and `N = [true, false, 
true]`, I see some options:
   
   * `hash(V) ^ !N + unique * N` where `unique` is a unique sentinel value 
exclusive for null values. If `hash` is vectorized, this operation is 
vectorized.
   
   * `concat(hash(value), is_valid) for value, is_valid in zip(V,N)`
   
   * split the array between nulls and not nulls, i.e. `N -> (non-null indices, 
null indices)`, perform hashing over valid indices only, and then, at the very 
end, append all values for the nulls. We do this in the sort kernel, to reduce 
the number of slots to perform comparisons over.
   
   If we could write the code in a way that we could "easily" switch between 
implementations (during dev only, not a conf parameter), we could bench whether 
one wins over the other, or under which circumstances.
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to