Dandandan commented on code in PR #22145:
URL: https://github.com/apache/datafusion/pull/22145#discussion_r3240189863
##########
datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs:
##########
@@ -79,147 +79,334 @@ hash_float!(f16, f32, f64);
///
/// This specialization is significantly faster than using the more general
/// purpose `Row`s format
+enum GroupValuesPrimitiveState<T: ArrowPrimitiveType> {
+ GroupIds {
+ /// Stores the `(group_index, hash)` based on the hash of its value
+ ///
+ /// We also store `hash` is for reducing cost of rehashing. Such cost
+ /// is obvious in high cardinality group by situation.
+ /// More details can see:
+ /// <https://github.com/apache/datafusion/issues/15961>
+ map: HashTable<(usize, u64)>,
Review Comment:
I found `HashTable<(usize, T::Native)>` also to be a bit faster than this
current approach (and also saving memory). We can build the `Vec<T::Native>`
when emitting data.
Could be part of some future PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]