alamb commented on issue #1456: URL: https://github.com/apache/arrow-datafusion/issues/1456#issuecomment-996204065
I think `eq_array` is a symptom, rather than the root cause
The eq_array is necessary in the current hash aggregate implementation to
detect hash collisions:
```
┌──────────────────────────────────────┐
┌─────────┐ │Bucket
│
│ │ │(
│
│HashTable│ ┌───────▶│ grp_key1: ScalarValue
│
│ │────────┘ │ grp_key2: ScalarValue
│
│ │ │)
│
│ │
└──────────────────────────────────────┘
└─────────┘
┌─────────────┬─────────────┐
│Group Column │Group Column │
│ A │ B │ Step 1: hash(grp_key1,
grp_key2) is
└─────────────┴─────────────┘ computed (vectorized)
... ...
┌─────────────┬─────────────┐ Step 2: bucket for that hash
value is
│ grp_key1 │ grp_key2 │ obtained
└─────────────┴─────────────┘
... ... Step 3: Validate that the values
stored in
the bucket are the same as the
input key
(aka that there are no hash
collisions)
eq_array is used for step 3
```
I am pretty sure this code is correct, though since it is general purpose
(works for all types) there is non trivial dispatch overhead
If you are trying to speed up a distinct aggregate calculation I suggest you
look into special casing group keys which are native types and which can be
packed into fixed length byte arrays (so they can be compared using mem
comparisons rather than dispatching on each column)
Another way of saying this is "don't try and remove `eq_array` but instead
try to remove the use of `Scalar` entirely
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
