[
https://issues.apache.org/jira/browse/ARROW-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16719134#comment-16719134
]
Francois Saint-Jacques commented on ARROW-47:
---------------------------------------------
At first sight, I'd say that StructScalar (and Scalar) memory layout will be
critical to the implementation of ARROW-3978 (and joins of multiple
columns/expressions), hashing/probing on the columnar representation SoA is a
performance killer (due k pointer indirections and cacheline reads where k is
the number of field).
The second thing, is that when we'll work with intermediary results of
`Scalar`s, the types will almost always be homogeneous. For example, when
computing the hash table of a join/group-by, you'll have something like
`hash<Scalar, Result>` where the type for each scalar instances is the same
(minus Null, but we can and should specialize for nullability). Thus adding the
type shared_ptr _and_ an `is_valid` boolean is somewhat costly (16 + 1 +
sizeof(primitive_type).
This optimization can be hidden in the implementation, but I wonder if we'll
have to expose the collections at the API boundaries.
> [C++] Consider adding a scalar type object model
> ------------------------------------------------
>
> Key: ARROW-47
> URL: https://issues.apache.org/jira/browse/ARROW-47
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Wes McKinney
> Assignee: Uwe L. Korn
> Priority: Major
> Labels: Analytics
> Fix For: 0.13.0
>
>
> Just did this on the Python side. In later analytics routines, passing in
> scalar values (example: Array + Scalar) requires some kind of container. Some
> systems, like the R language, solve this problem with length-1 arrays, but we
> should do some analysis of use cases and figure out what will work best for
> Arrow.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)