andygrove commented on a change in pull request #7687:
URL: https://github.com/apache/arrow/pull/7687#discussion_r461856982
##########
File path: rust/datafusion/src/execution/physical_plan/hash_aggregate.rs
##########
@@ -327,120 +278,47 @@ impl RecordBatchReader for GroupedHashAggregateIterator {
})
.collect::<ArrowResult<Vec<_>>>()?;
- // create vector large enough to hold the grouping key
+ // create vector to hold the grouping key
let mut key = Vec::with_capacity(group_values.len());
for _ in 0..group_values.len() {
- key.push(GroupByScalar::UInt32(0));
+ key.push(KeyScalar::UInt32(0));
}
// iterate over each row in the batch and create the accumulators
for each grouping key
- let mut accumulators: Vec<Rc<AccumulatorSet>> =
Review comment:
It isn't clear, but this is actually an optimization. By buiding the vec
of accumulators, it allows better columnar processing and presumably is making
better use of the cpu cache. This PR shows the same minor slow down in
performance as the other branch I created.
Before:
```
Running benchmarks with the following options: Opt { debug: false,
iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv/year=2019",
file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 11058 ms
Query 'fare_amt_by_passenger' iteration 1 took 11314 ms
Query 'fare_amt_by_passenger' iteration 2 took 11479 ms
```
After
```
Running benchmarks with the following options: Opt { debug: false,
iterations: 3, batch_size: 4096, path: "/mnt/nyctaxi/csv/year=2019",
file_format: "csv" }
Executing 'fare_amt_by_passenger'
Query 'fare_amt_by_passenger' iteration 0 took 11636 ms
Query 'fare_amt_by_passenger' iteration 1 took 12500 ms
Query 'fare_amt_by_passenger' iteration 2 took 12358 ms
```
This is from running the benchmark crate in this repo.
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]