[ 
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16824279#comment-16824279
 ] 

Wes McKinney commented on ARROW-3978:
-------------------------------------

There's different approaches. You might want to look at what an existing 
columnar database engine like Clickhouse or Dremio is doing for hashing tuples 
(aka structs). One approach is to "pivot" or "recordize" the data from columnar 
to record format (similar to NumPy's struct dtype memory layout, but it would 
need to be generalized of course to account for varbinary, nulls, and nested 
data -- nested data would have to be recursively flattened). 

NB the same code paths involved with hashing structs will need to be used for 
hash joins, hash aggregations, and other algorithms. 

[~jnadeau] do you have any advice for us or pointers to literature about this 
topic?

> [C++] Implement hashing, dictionary-encoding for StructArray
> ------------------------------------------------------------
>
>                 Key: ARROW-3978
>                 URL: https://issues.apache.org/jira/browse/ARROW-3978
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Priority: Major
>             Fix For: 0.14.0
>
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be 
> incrementally hashed to produce dictionary codes suitable for computing 
> aggregates or any other purpose. 
> There are a few subtasks related to this, such as efficiently constructing a 
> record (that can be hashed quickly) to identify each "row" in the struct. 
> Maybe we should start with that first



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to