Dandandan commented on code in PR #16136: URL: https://github.com/apache/datafusion/pull/16136#discussion_r2100719962
########## datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive/mod.rs: ########## @@ -74,21 +77,21 @@ macro_rules! hash_float { hash_float!(f16, f32, f64); -/// A [`GroupValues`] storing a single column of primitive values +/// A [`GroupValues`] storing a single column of normal primitive values (bits <= 64) /// /// This specialization is significantly faster than using the more general /// purpose `Row`s format pub struct GroupValuesPrimitive<T: ArrowPrimitiveType> { /// The data type of the output array data_type: DataType, - /// Stores the `(group_index, hash)` based on the hash of its value + /// Stores the `(group_index, group_value)` /// - /// We also store `hash` is for reducing cost of rehashing. Such cost - /// is obvious in high cardinality group by situation. + /// We directly store copy of `group_value` for not only efficient + /// rehashing, but also efficient probing. /// More details can see: - /// <https://github.com/apache/datafusion/issues/15961> + /// <https://github.com/apache/datafusion/issues/16136> /// - map: HashTable<(usize, u64)>, + map: HashTable<(usize, T::Native)>, Review Comment: Is storing `values: Vec<T::Native>` in this case necessary? It could be rebuild in `emit_internal` by traversing the map items I think (create a `Vec` and update by group index). This avoids doing that while inserting / updating. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org