Wes McKinney created ARROW-32:
---------------------------------

             Summary: C++: add hash table classes for fixed-byte-width and 
variable-length primitive arrays
                 Key: ARROW-32
                 URL: https://issues.apache.org/jira/browse/ARROW-32
             Project: Apache Arrow
          Issue Type: New Feature
          Components: C++
            Reporter: Wes McKinney
            Assignee: Wes McKinney


Some of the most important in-memory analytical routines are:

- unique
- contains / is-in
- match (see base::match in R or pandas.match)
- dictionary-encode (aka "factorize" as I call it)
- frequency-table (unique + observed frequencies)

At their lowest level these all involve either iterative hash table 
construction or construct-then-sweep (for the routines involving multiple 
arrays, e.g. contains/match). 

Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will 
require some more thought, but performing these operations on fixed-byte-width 
types and lists thereof (e.g. strings as List<UInt8>) is fairly straightforward 
and can be used to craft more complex hash-table based routines. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to