[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #6800: RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!)

via GitHub Mon, 03 Jul 2023 10:28:55 -0700


yjshen commented on code in PR #6800:
URL: https://github.com/apache/arrow-datafusion/pull/6800#discussion_r1251142165



##########
datafusion/physical-expr/src/aggregate/average.rs:
##########
@@ -383,6 +419,303 @@ impl RowAccumulator for AvgRowAccumulator {
     }
 }
 
+/// An accumulator to compute the average of PrimitiveArray<T>.
+/// Stores values as native types, and does overflow checking
+///
+/// F: Function that calcuates the average value from a sum of
+/// T::Native and a total count
+#[derive(Debug)]
+struct AvgGroupsAccumulator<T, F>
+where
+    T: ArrowNumericType + Send,
+    F: Fn(T::Native, u64) -> Result<T::Native> + Send,
+{
+    /// The type of the internal sum
+    sum_data_type: DataType,
+
+    /// The type of the returned sum
+    return_data_type: DataType,
+
+    /// Count per group (use u64 to make UInt64Array)
+    counts: Vec<u64>,
+
+    /// Sums per group, stored as the native type
+    sums: Vec<T::Native>,
+
+    /// If we have seen a null input value for this group_index
+    null_inputs: BooleanBufferBuilder,
+
+    /// Function that computes the average (value / count)
+    avg_fn: F,
+}
+
+impl<T, F> AvgGroupsAccumulator<T, F>
+where
+    T: ArrowNumericType + Send,
+    F: Fn(T::Native, u64) -> Result<T::Native> + Send,
+{
+    pub fn new(sum_data_type: &DataType, return_data_type: &DataType, avg_fn: 
F) -> Self {
+        debug!(
+            "AvgGroupsAccumulator ({}, sum type: {sum_data_type:?}) --> 
{return_data_type:?}",
+            std::any::type_name::<T>()
+        );
+
+        Self {
+            return_data_type: return_data_type.clone(),
+            sum_data_type: sum_data_type.clone(),
+            counts: vec![],
+            sums: vec![],
+            null_inputs: BooleanBufferBuilder::new(0),
+            avg_fn,
+        }
+    }
+
+    /// Adds one to each group's counter
+    fn increment_counts(
+        &mut self,
+        group_indices: &[usize],
+        values: &PrimitiveArray<T>,
+        opt_filter: Option<&arrow_array::BooleanArray>,
+        total_num_groups: usize,
+    ) {
+        self.counts.resize(total_num_groups, 0);

Review Comment:
   I think this is the point I made earlier in the issue:
   
   ```Adaptive sizing(perhaps?): How would the hash table header and states in 
each accumulator initialize and grow their sizes afterward?```
   
   While reading the design doc, I thought the states inside each accumulator 
would be segmented into **fixed-sized vectors, allocate a new vector at a 
time**, fill it until full, then create a new vector for upcoming new states.
   
   ```
                                            ┌──────────────┐   ┌──────────────┐ 
  ┌──────────────┐
                                            │┌────────────┐│   │┌────────────┐│ 
  │┌────────────┐│
       ┌─────┐                              ││accumulator ││   ││accumulator ││ 
  ││accumulator ││
       │  5  │                              ││     AGG    ││   ││     SUM    ││ 
  ││     0      ││
       ├─────┤                              ││ ┌────────┐ ││   ││ ┌────────┐ ││ 
  ││ ┌────────┐ ││
       │  9  │                              ││ │ state- │ ││   ││ │ state- │ ││ 
  ││ │ state  │ ││
       ├─────┤                              ││ │segment-│ ││   ││ │segment-│ ││ 
  ││ │        │ ││
       │     │                              ││ │   1    │ ││   ││ │   1    │ ││ 
  ││ │        │ ││
       ├─────┤                              ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
       │  1  │                              ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
       ├─────┤                              ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
       │     │                              ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
       └─────┘                              ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
                                            ││ │        │ ││   ││ │        │ ││ 
  ││ │        │ ││
                                            ││ └────────┘ ││   ││ └────────┘ ││ 
  ││ └────────┘ ││
                                            ││            ││   ││            ││ 
  │└────────────┘│
       Hash Table                           ││ ┌────────┐ ││   ││ ┌────────┐ ││ 
  └──────────────┘
                                            ││ │ state- │ ││   ││ │ state- │ ││ 
                  
                                            ││ │segment-│ ││   ││ │segment-│ ││ 
                  
                                            ││ │   2    │ ││   ││ │   2    │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ │        │ ││   ││ │        │ ││ 
                  
                                            ││ └────────┘ ││   ││ └────────┘ ││ 
                  
                                            │└────────────┘│   │└────────────┘│ 
                  
                                            └──────────────┘   └──────────────┘ 
                  
                                                                                
                  
   ```   
   
   Thru this segmented approach, we could avoid memory copy for each resize, 
which the number of resizing would be great for high cardinality aggs, and 
grows the size more predictably. 
   
   But admittedly, this approach would also bring complexity for both header 
pointer management and update span multiple vectors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] yjshen commented on a diff in pull request #6800: RFC: Demonstrate new `GroupHashAggregate` stream approach (runs more than 2x faster!)

Reply via email to