[
https://issues.apache.org/jira/browse/ARROW-3978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16825451#comment-16825451
]
Jacques Nadeau commented on ARROW-3978:
---------------------------------------
Here is some info about what we found worked well. Note that it doesn't go into
a lot of detail about the pivot algorithm beyond the basic concepts of fixed
and variable vectors.
[https://docs.google.com/document/d/1Yk6IvDL28IzEjqcqSkFdevRyMrC8_kwzEatHvcOnawM/edit]
Main idea around pivot:
* separate fixed and variable and have each continguous
* coalesce bits for nullability and values together at the start of the data
structure (save space, increase likelihood of mismatch early)
* include length of variable in fixed container to reduce likelihood of
jumping to variable container.
* Have specialized cases that look at actual existence of nulls for each word
and fork behavior based on that to improve performance of common case where
things are mostly null or not null.
The latest code for the Arrow pivot algorithms specifically that we use can be
found here:
Pivots:
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Pivots.java]
Unpivots:
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/Unpivots.java]
Hash Table:
[https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/common/ht2/LBlockHashTable.java]
We'd be happy to donate this code/algo to the community as it would probably
serve as a good foundation.
Note the doc is probably somewhat out of date with the actual implementation as
it was written early on in development.
> [C++] Implement hashing, dictionary-encoding for StructArray
> ------------------------------------------------------------
>
> Key: ARROW-3978
> URL: https://issues.apache.org/jira/browse/ARROW-3978
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Priority: Major
> Fix For: 0.14.0
>
>
> This is a central requirement for hash-aggregations such as
> {code}
> SELECT AGG_FUNCTION(expr)
> FROM table
> GROUP BY expr1, expr2, ...
> {code}
> The materialized keys in the GROUP BY section form a struct, which can be
> incrementally hashed to produce dictionary codes suitable for computing
> aggregates or any other purpose.
> There are a few subtasks related to this, such as efficiently constructing a
> record (that can be hashed quickly) to identify each "row" in the struct.
> Maybe we should start with that first
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)