alamb commented on code in PR #6800:
URL: https://github.com/apache/arrow-datafusion/pull/6800#discussion_r1260023856
##########
datafusion/physical-expr/src/aggregate/average.rs:
##########
@@ -383,6 +435,189 @@ impl RowAccumulator for AvgRowAccumulator {
}
}
+/// An accumulator to compute the average of `[PrimitiveArray<T>]`.
+/// Stores values as native types, and does overflow checking
+///
+/// F: Function that calcuates the average value from a sum of
+/// T::Native and a total count
+#[derive(Debug)]
+struct AvgGroupsAccumulator<T, F>
+where
+ T: ArrowNumericType + Send,
+ F: Fn(T::Native, u64) -> Result<T::Native> + Send,
+{
+ /// The type of the internal sum
+ sum_data_type: DataType,
+
+ /// The type of the returned sum
+ return_data_type: DataType,
+
+ /// Count per group (use u64 to make UInt64Array)
+ counts: Vec<u64>,
+
+ /// Sums per group, stored as the native type
+ sums: Vec<T::Native>,
Review Comment:
> Is it possible to combine the counts and sums into one property, like
avg_states: Vec<(T::Native, u64)>? Since one sum and the related count are
always used together, I think it's better to put them together for better cache
locality.
Thank you for the comment @yahoNanJing
The reason the sums and counts are stored separately is to minimize copying
when forming the final output -- since the final output is columnar (two
columns) keeping the data as two `Vec`s allows the final `ArrayRefs` to be
created directly from that data.
It would be an interesting experiment to see if keeping them together and
improving cache locality outweighed the extra copy.
BTW if people are looking to optimize the inner loops more, I think removing
the bounds checks with unsafe might also help (but I don't plan to pursue it
until I find need to optimize more)
So instead of
```rust
let sum = &mut self.sums[group_index];
*sum = sum.add_wrapping(new_value);
```
```rust
unsafe {
let sum = sums.get_unchecked_mut(group_index);
*sum = sum.add_wrapping(new_value);
}
```
##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -111,6 +111,8 @@ pub(crate) struct GroupedHashAggregateStream {
/// first element in the array corresponds to normal accumulators
/// second element in the array corresponds to row accumulators
indices: [Vec<Range<usize>>; 2],
+ // buffer to be reused to store hashes
+ hashes_buffer: Vec<u64>,
Review Comment:
Note that this change was made to the existing row_hash (not the new one). I
will port the change to the new one as part of
https://github.com/apache/arrow-datafusion/pull/6904
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]