alamb edited a comment on pull request #844:
URL: https://github.com/apache/arrow-datafusion/pull/844#issuecomment-895383017
> If I recall, we will need to perform N x M comparisons where N is the
number of rows in the batch and M the distinct number of items in a group,
around here, roughly represented in for (row, hash) in
batch_hashes.into_iter().enumerate() and the inner
group_values.iter()....all(op).
Yes, I think that is accurate. I would love to figure out how to use a
vectorized approach
In #808 we have a hash table mapping (effectively) `[ScalarValue]` ->
`(aggregate data)` and we have input `[ArrayRef]`. The hash calculation for
each input row is vectorized (so that is good), and then there is a loop that
looks like
```rust
for (hash, index) in ... {
// look up entry in hash table for row `index`
// check if the values at `[ArrayRefs]` @ `index` are the same as in the
entry in the table (what this PR's code, `array_eq`, is used for)
// ...
}
```
In order to vectorize this calculation, I think we would have to somehow
vectorize both the lookup in the table as well as the comparison.
I suppose we could potentially copy the existing keys out of the hash
entries (so they are contiguous) to do a vectorized comparison but my intuition
is that the cost of this copy would outweigh any gains due to vectorization
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]