Wes McKinney created ARROW-32:
---------------------------------
Summary: C++: add hash table classes for fixed-byte-width and
variable-length primitive arrays
Key: ARROW-32
URL: https://issues.apache.org/jira/browse/ARROW-32
Project: Apache Arrow
Issue Type: New Feature
Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney
Some of the most important in-memory analytical routines are:
- unique
- contains / is-in
- match (see base::match in R or pandas.match)
- dictionary-encode (aka "factorize" as I call it)
- frequency-table (unique + observed frequencies)
At their lowest level these all involve either iterative hash table
construction or construct-then-sweep (for the routines involving multiple
arrays, e.g. contains/match).
Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will
require some more thought, but performing these operations on fixed-byte-width
types and lists thereof (e.g. strings as List<UInt8>) is fairly straightforward
and can be used to craft more complex hash-table based routines.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)