[
https://issues.apache.org/jira/browse/ARROW-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-32:
----------------------------
External issue URL: https://github.com/apache/arrow/issues/15400
> C++: add hash table classes for fixed-byte-width and variable-length
> primitive arrays
> -------------------------------------------------------------------------------------
>
> Key: ARROW-32
> URL: https://issues.apache.org/jira/browse/ARROW-32
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
>
> Some of the most important in-memory analytical routines are:
> - unique
> - contains / is-in
> - match (see base::match in R or pandas.match)
> - dictionary-encode (aka "factorize" as I call it)
> - frequency-table (unique + observed frequencies)
> At their lowest level these all involve either iterative hash table
> construction or construct-then-sweep (for the routines involving multiple
> arrays, e.g. contains/match).
> Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will
> require some more thought, but performing these operations on
> fixed-byte-width types and lists thereof (e.g. strings as List<UInt8>) is
> fairly straightforward and can be used to craft more complex hash-table based
> routines.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)