jorgecarleitao commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731


   Great proposal.
   
   From the hashing side, an unknown to me atm is how to efficiently hash 
`values+validity`. I.e. given `V = ["a", "", "c"]` and `N = [true, false, 
true]`, I see some options:
   
   * `hash(V) ^ !N + unique * N` where `unique` is a unique sentinel value 
exclusive for null values. If `hash` is vectorized, this operation is 
vectorized.
   
   * `concat(hash(value), is_valid) for value, is_valid in zip(V,N)`
   
   * split the array between nulls and not nulls, i.e. `N -> (non-null indices, 
null indices)`, perform hashing over valid indices only, and then, at the very 
end, append all values for the nulls. We do this in the sort kernel, to reduce 
the number of slots to perform comparisons over.
   
   If we could write the code in a way that we could "easily" switch between 
implementations (during dev only, not a conf parameter), we could bench whether 
one wins over the other, or under which circumstances.
   
   Regardless, nulls in the group by are so important that IMO any is +1 at 
this point xD


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to