jorgecarleitao commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-888516731
Great proposal. From the hashing side, an unknown to me atm is how to efficiently hash `values+validity`. I.e. given `V = ["a", "", "c"]` and `N = [true, false, true]`, I see some options: * `hash(V) ^ !N + unique * N` where `unique` is a unique sentinel value exclusive for null values. If `hash` is vectorized, this operation is vectorized. * `concat(hash(value), is_valid) for value, is_valid in zip(V,N)` * split the array between nulls and not nulls, i.e. `N -> (non-null indices, null indices)`, perform hashing over valid indices only, and then, at the very end, append all values for the nulls. We do this in the sort kernel, to reduce the number of slots to perform comparisons over. If we could write the code in a way that we could "easily" switch between implementations (during dev only, not a conf parameter), we could bench whether one wins over the other, or under which circumstances. Regardless, nulls in the group by are so important that IMO any is +1 at this point xD -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
