[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6800: RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!)

via GitHub Tue, 11 Jul 2023 10:07:25 -0700


alamb commented on code in PR #6800:
URL: https://github.com/apache/arrow-datafusion/pull/6800#discussion_r1260023856



##########
datafusion/physical-expr/src/aggregate/average.rs:
##########
@@ -383,6 +435,189 @@ impl RowAccumulator for AvgRowAccumulator {
     }
 }
 
+/// An accumulator to compute the average of `[PrimitiveArray<T>]`.
+/// Stores values as native types, and does overflow checking
+///
+/// F: Function that calcuates the average value from a sum of
+/// T::Native and a total count
+#[derive(Debug)]
+struct AvgGroupsAccumulator<T, F>
+where
+    T: ArrowNumericType + Send,
+    F: Fn(T::Native, u64) -> Result<T::Native> + Send,
+{
+    /// The type of the internal sum
+    sum_data_type: DataType,
+
+    /// The type of the returned sum
+    return_data_type: DataType,
+
+    /// Count per group (use u64 to make UInt64Array)
+    counts: Vec<u64>,
+
+    /// Sums per group, stored as the native type
+    sums: Vec<T::Native>,

Review Comment:
   > Is it possible to combine the counts and sums into one property, like 
avg_states: Vec<(T::Native, u64)>? Since one sum and the related count are 
always used together, I think it's better to put them together for better cache 
locality.
   
   Thank you for the comment @yahoNanJing 
   
   The reason the sums and counts are stored separately is to minimize copying 
when forming the final output -- since the final output is columnar (two 
columns) keeping the data as two `Vec`s allows the final `ArrayRefs` to be 
created directly from that data. 
   
   It would be an interesting experiment to see if keeping them together and 
improving cache locality outweighed the extra copy.
   
   BTW if people are looking to optimize the inner loops more, I think removing 
the bounds checks with unsafe might also help (but I don't plan to pursue it 
until I find need to optimize more)
   
   So instead of 
   ```rust
                   let sum = &mut self.sums[group_index];
                   *sum = sum.add_wrapping(new_value);
   ```
   
   ```rust
                   unsafe {
                     let sum = sums.get_unchecked_mut(group_index);
                     *sum = sum.add_wrapping(new_value);
                   }
   ```



##########
datafusion/core/src/physical_plan/aggregates/row_hash.rs:
##########
@@ -111,6 +111,8 @@ pub(crate) struct GroupedHashAggregateStream {
     /// first element in the array corresponds to normal accumulators
     /// second element in the array corresponds to row accumulators
     indices: [Vec<Range<usize>>; 2],
+    // buffer to be reused to store hashes
+    hashes_buffer: Vec<u64>,

Review Comment:
   Note that this change was made to the existing row_hash (not the new one). I 
will port the change to the new one as part of 
https://github.com/apache/arrow-datafusion/pull/6904



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6800: RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!)

Reply via email to