Andrew Lamb created ARROW-11635:
-----------------------------------
Summary: [Rust] [DataFusion] Improve performance for
grouping/hashing on dictionary encoded data
Key: ARROW-11635
URL: https://issues.apache.org/jira/browse/ARROW-11635
Project: Apache Arrow
Issue Type: Improvement
Reporter: Andrew Lamb
I am recording this for posterity / potential for someone else to help if they
want:
While adding support for GROUP BY hash, [~jorgecarleitao] had some great
suggestions
https://github.com/apache/arrow/pull/9233#issuecomment-762174671
The initial GROUP BY implementation hashes the actual value of the dictionary
(aka looks up the underlying value). For the common case such as when the
dictionary contains strings, this will likely do much more work than is
necessary.
In the common case we should be able to hash the dictionary indexes directly,
or possibly skip hashing entirely and build an aggregate table directly from
the indexes -- this would work incredibly well for low cardinality string
columns
What makes it tricky is that we would have to handle the case where the
dictionary itself is not the same across all record batches (and thus indexes
in one record batch may not correspond to the same value in another)
Some possibly implementation ideas are:
Implement a special case for a shared dictionary across all input record
batches, and have code to switch back to the more general case (hash table) if
the dictionary ever changes.
Alternately, we could hold a hash table (or equivalent) for each distinct
dictionary we saw and merge them all at the end.
The second approach likely would likely be the fastest, but also would
potentially consume the most resources
--
This message was sent by Atlassian Jira
(v8.3.4#803005)