[
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824279#comment-16824279
]
Wes McKinney commented on ARROW-3978:
-------------------------------------
There's different approaches. You might want to look at what an existing
columnar database engine like Clickhouse or Dremio is doing for hashing tuples
(aka structs). One approach is to "pivot" or "recordize" the data from columnar
to record format (similar to NumPy's struct dtype memory layout, but it would
need to be generalized of course to account for varbinary, nulls, and nested
data -- nested data would have to be recursively flattened).
NB the same code paths involved with hashing structs will need to be used for
hash joins, hash aggregations, and other algorithms.
[~jnadeau] do you have any advice for us or pointers to literature about this
topic?
> [C++] Implement hashing, dictionary-encoding for StructArray
> ------------------------------------------------------------
>
> Key: ARROW-3978
> URL: https://issues.apache.org/jira/browse/ARROW-3978
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 0.14.0
>
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be
> incrementally hashed to produce dictionary codes suitable for computing
> aggregates or any other purpose.
> There are a few subtasks related to this, such as efficiently constructing a
> record (that can be hashed quickly) to identify each "row" in the struct.
> Maybe we should start with that first
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)