[ 
https://issues.apache.org/jira/browse/ARROW-32?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rok Mihevc updated ARROW-32:
----------------------------
    External issue URL: https://github.com/apache/arrow/issues/15400

> C++: add hash table classes for fixed-byte-width and variable-length 
> primitive arrays
> -------------------------------------------------------------------------------------
>
>                 Key: ARROW-32
>                 URL: https://issues.apache.org/jira/browse/ARROW-32
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Antoine Pitrou
>            Priority: Major
>
> Some of the most important in-memory analytical routines are:
> - unique
> - contains / is-in
> - match (see base::match in R or pandas.match)
> - dictionary-encode (aka "factorize" as I call it)
> - frequency-table (unique + observed frequencies)
> At their lowest level these all involve either iterative hash table 
> construction or construct-then-sweep (for the routines involving multiple 
> arrays, e.g. contains/match). 
> Hashing more complex Arrow structures (e.g. structs or lists-of-structs) will 
> require some more thought, but performing these operations on 
> fixed-byte-width types and lists thereof (e.g. strings as List<UInt8>) is 
> fairly straightforward and can be used to craft more complex hash-table based 
> routines. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to