Since ARROW-12739 is a binary/dyadic elementwise function (taking
(string, string) -> list<string>), it makes sense to implement as a
compute function / ScalarKernel.

I agree that some utility functions that we have may be able to be
reframed as compute functions. Speaking of which, we might consider
promoting the FunctionRegistry and the other base machinery of Arrow
array functions to "always-on" status (rather than toggled with
ARROW_COMPUTE=on) and instead let ARROW_COMPUTE=on toggle the
compilation of a compendium of optional kernels.

In the fullness of time, we might want to establish more granularity /
a hierarchy of compute functions so that a user of this system isn't
left with an "all or nothing" decision about including all the
compiled kernels in their project even if they only need a couple of
functions.

On Tue, May 11, 2021 at 3:50 PM Eduardo Ponce <[email protected]> wrote:
>
> This is a very good question.
> I agree with @Antoine and would like to add that the focus of compute
> functions is to have a public API
> while utility functions are for internal use.
>
> A similar operation to ARROW-12739 are structural transformations [1] such
> as "list_flatten" [2],
> which makes use of a memory pool. Based on this, I would consider it a
> compute kernel as a query engine
> can benefit from it. To be more precise, compute functions are defined as
> "analytical functions that process
> primarily columnar data for either scalar or array inputs. These are
> intended for use inside query engines,
> data frames, etc."
>
> Nevertheless, there are utility functions which make use of memory pools
> (e.g., bitmap operations),
> so I do not think that the use of a memory pool should dictate between
> utility and compute functions.
>
> ~Eduardo
>
> [1] https://arrow.apache.org/docs/cpp/compute.html#id2
> [2]
> https://github.com/edponce/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_nested.cc
>
> On Tue, May 11, 2021 at 4:13 PM Antoine Pitrou <[email protected]> wrote:
>
> >
> > Le 11/05/2021 à 22:10, Weston Pace a écrit :
> > > How does one decide between "utility function" and "compute function"?
> > >    For example, https://issues.apache.org/jira/browse/ARROW-12739 is
> > > very similar to StructArray::Make which is implemented as a static
> > > function.  However, 12739 would require pool allocation (to
> > > concatenate the list items into one large contiguous array) and array
> > > iteration (to copy into the allocated array).  Does that make it a
> > > compute function?
> >
> > If it's useful internally as a building block, then IMHO it should
> > probably be a utility function.
> >
> > In this case it is a user request, and it has a non-trivial computation
> > cost, so I'd say it should be a compute function.
> >
> > Regards
> >
> > Antoine.
> >

Reply via email to