[jira] [Created] (ARROW-11635) [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data

Andrew Lamb (Jira) Mon, 15 Feb 2021 12:37:06 -0800

Andrew Lamb created ARROW-11635:
-----------------------------------

             Summary: [Rust] [DataFusion] Improve performance for 
grouping/hashing on dictionary encoded data
                 Key: ARROW-11635
                 URL: https://issues.apache.org/jira/browse/ARROW-11635
             Project: Apache Arrow
          Issue Type: Improvement
            Reporter: Andrew Lamb



I am recording this for posterity / potential for someone else to help if they 
want:

While adding support for GROUP BY hash, [~jorgecarleitao] had some great 
suggestions
https://github.com/apache/arrow/pull/9233#issuecomment-762174671

The initial GROUP BY implementation hashes the actual value of the dictionary 
(aka looks up the underlying value). For the common case such as when the 
dictionary contains strings, this will likely do much more work than is 
necessary. 

In the common case we should be able to hash the dictionary indexes directly, 
or  possibly skip hashing entirely and build an aggregate table directly from 
the indexes  -- this would work incredibly well for low cardinality string 
columns

What makes it tricky is that we would have to handle the case where the 
dictionary itself is not the same across all record batches (and thus indexes 
in one record batch may not correspond to the same value in another)

Some possibly implementation ideas are:
Implement a special case for a shared dictionary across all input record 
batches, and have code to switch back to the more general case (hash table) if 
the dictionary ever changes.

Alternately, we could hold a hash table (or equivalent) for each distinct 
dictionary we saw and merge them all at the end. 

The second approach likely would likely be the fastest, but also would 
potentially consume the most resources



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (ARROW-11635) [Rust] [DataFusion] Improve performance for grouping/hashing on dictionary encoded data

Reply via email to