[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

GitBox Wed, 04 Aug 2021 10:14:28 -0700


alamb commented on issue #790:
URL: 
https://github.com/apache/arrow-datafusion/issues/790#issuecomment-892830006



   >  Consistent hashes between Arrays and Scalar sound very nice but I think 
will require an extensive test suite for all data types so it doesn't get 
broken accidentally.
   
   I agree the testing would be key here, and I am willing to write such tests.
   
   > An alternative could be to store the calculated hash also in the GroupKey, 
   
   I tried to do this but I couldn't figure out how to do so using 
the`std::hash::Hash` API. I didn't find some way to return the hash value 
directly, only to update the intermediate value of a `Hasher`
   
   `insert_hashed_no_check` is a good one, though I think it still requires 
that consistency between `create_hash` and `ScalarValue::hash` for the case of 
collisions, right?
   
   > For reference, my experiments with grouping by mapping each input row to a 
consecutive integer used hashbrown like this:
   
   Yes that is a cool idea. I wonder if we could use something like that as an 
initial partial aggregate pass:  we would initially aggregate each batch 
partially as you describe and then update the overall aggregates from the 
partials. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #790: Rework GroupByHash for faster performance and support grouping by nulls

Reply via email to