alamb commented on issue #790: URL: https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892830006
> Consistent hashes between Arrays and Scalar sound very nice but I think will require an extensive test suite for all data types so it doesn't get broken accidentally. I agree the testing would be key here, and I am willing to write such tests. > An alternative could be to store the calculated hash also in the GroupKey, I tried to do this but I couldn't figure out how to do so using the`std::hash::Hash` API. I didn't find some way to return the hash value directly, only to update the intermediate value of a `Hasher` `insert_hashed_no_check` is a good one, though I think it still requires that consistency between `create_hash` and `ScalarValue::hash` for the case of collisions, right? > For reference, my experiments with grouping by mapping each input row to a consecutive integer used hashbrown like this: Yes that is a cool idea. I wonder if we could use something like that as an initial partial aggregate pass: we would initially aggregate each batch partially as you describe and then update the overall aggregates from the partials. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
