andygrove commented on a change in pull request #7687:
URL: https://github.com/apache/arrow/pull/7687#discussion_r461856982



##########
File path: rust/datafusion/src/execution/physical_plan/hash_aggregate.rs
##########
@@ -327,120 +278,47 @@ impl RecordBatchReader for GroupedHashAggregateIterator {
                 })
                 .collect::<ArrowResult<Vec<_>>>()?;
 
-            // create vector large enough to hold the grouping key
+            // create vector to hold the grouping key
             let mut key = Vec::with_capacity(group_values.len());
             for _ in 0..group_values.len() {
-                key.push(GroupByScalar::UInt32(0));
+                key.push(KeyScalar::UInt32(0));
             }
 
             // iterate over each row in the batch and create the accumulators 
for each grouping key
-            let mut accumulators: Vec<Rc<AccumulatorSet>> =

Review comment:
       It isn't clear, but this is actually an optimization. By buiding the vec 
of accumulators, it allows better columnar processing and presumably is making 
better use of the cpu cache. This PR shows the same minor slow down in 
performance as the other branch I created.
   
   Before:
   
   ```
   Running benchmarks with the following options: Opt { debug: false, 
iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv/year=2019", 
file_format: "csv" }
   Executing 'fare_amt_by_passenger'
   Query 'fare_amt_by_passenger' iteration 0 took 11058 ms
   Query 'fare_amt_by_passenger' iteration 1 took 11314 ms
   Query 'fare_amt_by_passenger' iteration 2 took 11479 ms
   ```
   
   After
   
   ```
   Running benchmarks with the following options: Opt { debug: false, 
iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv/year=2019", 
file_format: "csv" }
   Executing 'fare_amt_by_passenger'
   Query 'fare_amt_by_passenger' iteration 0 took 11636 ms
   Query 'fare_amt_by_passenger' iteration 1 took 12500 ms
   Query 'fare_amt_by_passenger' iteration 2 took 12358 ms
   ```
   
   This is from running the benchmark crate in this repo.
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to