Dandandan commented on issue #418:
URL:
https://github.com/apache/arrow-datafusion/issues/418#issuecomment-850491025
@ravlio
You are very right, that part is suboptimal.
Also in hash aggregates there are a couple of other things:
* Keys are created by row and indexed by ` Vec<u8>`. Each `Vec<u8>` value
will be hashed, but is also rehashed as the hash table grows (rusts hashmap
doesn' t remember hash values by default, so it has to compute it again every
time the capacity of the table is exhausted).
We should do something similar (vectorized hashing) in the hash aggregates
as we do in hash join. There the `create_hashes` function generates `u64` hash
values for each row in one simple loop without downcasting / converting per
row. The challenging part of this is detecting hash collisions and making sure
that that is fast too.
* Hash aggregates with small groups generate lot of small allocations and
dynamically typed scalar values and trait objects for the aggregate states.
Also in some parts, intermediate `Vec`s are allocated which also show up in
profiles.
I've addressed some performance issues in different parts earlier, however
there are still large gains possible, mainly for some more challenging queries.
I think what would be bes in the long run is building a mutable typed array
based for the aggregation states, and keeping only the *offsets* to that array
in a hash table structure. This would allow for vectorized updating the states
like this (pseudo code):
```rust
// sum
for (offset, val) in batch_offsets.chain(batch_vals) {
states[offset] += val;
}
// count
for offset in batch_offsets {
states[offset] += 1;
}
// avg
for (offset, val) in batch_offsets.chain(batch_vals) {
state_avg_sum[offset] += val;
state_avg_rows[offset] += 1;
}
```
After the hash aggregation - the arrays don't need any further processing as
all of the `offset` values are already increasing for each row.
(currently the group by values are converted to arrays again, which again
takes time).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]