[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

Eduardo Ponce (Jira) Fri, 22 Apr 2022 12:54:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526656#comment-17526656
 ]


Eduardo Ponce commented on ARROW-16289:
---------------------------------------

The term Scalar is used in different (but related) contexts. For example, the 
notion of a Scalar value, Scalar kernels, Scalar expressions, etc.

I recall from an ad-hoc conversation last year where it was discussed that we 
should consider treating Scalars as a 1-element Array to making the compute 
layer logic more straightforward. The front-end API would still have the 
concept of a Scalar but it would be disguised as an Array for execution 
purposes.

I think such a proposal has its merits, but we should ensure where the concept 
of Scalar will remain and make these distinctions clear.

> [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE 
> encoded arrays
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-16289
>                 URL: https://issues.apache.org/jira/browse/ARROW-16289
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> This JIRA is a proposal / discussion.  I am not asserting this is the way to 
> go but I would like to consider it.
> From the execution engine's perspective an exec batch's columns are always 
> either arrays or scalars.  The only time we make use of scalars today is for 
> the four augmented columns (e.g. __filename).  Once we have support for RLE 
> arrays a scalar could easily be encoded as an RLE array and there would be no 
> need to use scalars here.
> The advantage would be reducing the complexity in exec nodes and avoiding 
> issues like ARROW-16288.  It is already rather difficult to explain the idea 
> of a "scalar" and "vector" function and then have to turn around and explain 
> that the word "scalar" has an entirely different meaning when talking about 
> field shape.
> I think it's worth considering taking this even further and removing the 
> concept from the compute layer entirely.  Kernel functions that want to have 
> special logic for scalars could do so using the RLE array.  This would be a 
> significant change to many kernels which currently declare the ANY shape and 
> determine which logic to apply within the kernel itself (e.g. there is one 
> array OR scalar kernel and not one kernel for each).
> Admittedly there is probably a few instructions and a few bytes more to 
> handle an RLE scalar than the scalar we have today.  However, this is just 
> different flavors of O(1) and not likely to have significant impact.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-16289) [C++] (eventually) abandon scalar columns of an ExecBatch in favor of RLE encoded arrays

Reply via email to