[
https://issues.apache.org/jira/browse/ARROW-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wes McKinney closed ARROW-32.
-----------------------------
Resolution: Duplicate
Assignee: Antoine Pitrou (was: Wes McKinney)
This issue is a bit ill-defined, so I'm closing. We have a lot of what I was
intending in March 2016 now, see also
https://github.com/apache/arrow/commit/eaf8d32e5f292dca0aa5b5508041d5d39406224d
> C++: add hash table classes for fixed-byte-width and variable-length
> primitive arrays
> -------------------------------------------------------------------------------------
>
> Key: ARROW-32
> URL: https://issues.apache.org/jira/browse/ARROW-32
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Antoine Pitrou
> Priority: Major
>
> Some of the most important in-memory analytical routines are:
> - unique
> - contains / is-in
> - match (see base::match in R or pandas.match)
> - dictionary-encode (aka "factorize" as I call it)
> - frequency-table (unique + observed frequencies)
> At their lowest level these all involve either iterative hash table
> construction or construct-then-sweep (for the routines involving multiple
> arrays, e.g. contains/match).
> Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will
> require some more thought, but performing these operations on
> fixed-byte-width types and lists thereof (e.g. strings as List<UInt8>) is
> fairly straightforward and can be used to craft more complex hash-table based
> routines.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)