*internal.h headers

Niranda Perera Tue, 17 Nov 2020 18:58:58 -0800

1. This is great. I will follow this JIRA. (better yet, I'll see if I can
make that contribution)


2. If we forget about the multithreading case for a moment, this
requirement came up while implementing a "groupby + aggregation" operation
(single-threaded). Let's assume that a table is not sorted. So, the
simplest approach would be to keep the intermediate state in a container
(map/vector) and update the state while traversing the table. This approach
becomes important when there is a large number of groups and there aren't
enough rows with the same key to use 'consume' vector aggregation (on a
sorted table).



On Tue, Nov 17, 2020 at 10:54 AM Benjamin Kietzman <bengil...@gmail.com>
wrote:

> Hi Niranda,
>
> hastebin: That looks generally correct, though I should warn you that a
> recent PR
> ( https://github.com/apache/arrow/pull/8574 ) changed the return type of
> DispatchExact to Kernel so you'll need to insert an explicit cast to
> ScalarAggregateKernel.
>
> 1: This seems like a feature which might be generally useful to consumers
> of
> the compute module, so it's probably worth adding to the KernelState
> interface
> in some way rather than exposing individual implementations. I've opened
> https://issues.apache.org/jira/browse/ARROW-10630 to track this feature
>
> 2: I would not expect your container to contain a large number of
> KernelStates.
> Specifically: why would you need more than one per local thread?
> Additionally
> for the specific case of aggregation I'd expect that KernelStates not
> actively in
> use would be `merge`d and no longer stored. With small numbers of instances
> I would expect the memory overhead due to polymorphism to be negligible.
>
> On Mon, Nov 16, 2020 at 7:03 PM Niranda Perera <niranda.per...@gmail.com>
> wrote:
>
>> Hi Ben and Wes,
>> Based on our discussion, I did the following.
>> https://hastebin.com/ajadonados.cpp
>>
>> It seems to be working fine. Would love to get your feedback on this!
>> :-)
>>
>> But I have a couple of concerns.
>> 1. Say I want to communicate the intermediate state data across multiple
>> processes. Unfortunately, KernelState struct does not expose the data
>> pointer to the outside. If say SumState is exposed, we could have accessed
>> that data, isn't it? WDYT?
>> 2. Polymorphism and virtual functions - Intuitively, a mean aggregation
>> intermediate state would be a {T, int64} tuple. But I believe the size of
>> struct "SumImpl : public ScalarAggregator (:public KernelState)" would be
>> sizeof(T) + sizeof(int64) + sizeof(ScalarAggregator) + sizeof(vptr),
>> isn't it? So, if I am managing a compute state container, this means that
>> my memory requirement would be higher than simply using a {T, int64} tuple.
>> Please correct me if I am wrong. I am not sure if there is a better
>> solution to this, but just want to discuss it with you.
>>
>>
>> On Tue, Nov 10, 2020 at 9:44 AM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> Yes, open a Jira and propose a PR implementing the changes you need
>>>
>>> On Mon, Nov 9, 2020 at 8:31 PM Niranda Perera <niranda.per...@gmail.com>
>>> wrote:
>>> >
>>> > @wes How should I proceed with this nevertheless? should I open a JIRA?
>>> >
>>> > On Mon, Nov 9, 2020 at 11:09 AM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >
>>> > > On Mon, Nov 9, 2020 at 9:32 AM Niranda Perera <
>>> niranda.per...@gmail.com>
>>> > > wrote:
>>> > > >
>>> > > > @Ben
>>> > > > Thank you very much for the feedback. But unfortunately, I was
>>> unable to
>>> > > > find a header that exposes a SumAggregateKernel in the v2.0.0.
>>> Maybe I am
>>> > > > checking it wrong. I remember accessing them in v0.16 IINM.
>>> > > >
>>> > > > @Wes
>>> > > > Yes, that would be great. How about adding a CMake compilation
>>> flag for
>>> > > > such dev use cases?
>>> > > >
>>> > >
>>> > > This seems like it could cause more problems -- I think it would be
>>> > > sufficient to use an "internal::" C++ namespace and always install
>>> the
>>> > > relevant header file
>>> > >
>>> > > >
>>> > > >
>>> > > > On Sun, Nov 8, 2020 at 9:14 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> > > >
>>> > > > > I'm not opposed to installing headers that provide access to
>>> some of
>>> > > > > the kernel implementation internals (with the caveat that changes
>>> > > > > won't go through a deprecation cycle, so caveat emptor). It
>>> might be
>>> > > > > more sustainable to think about what kind of stable-ish public
>>> API
>>> > > > > could be exported to support applications like Cylon.
>>> > > > >
>>> > > > > On Sun, Nov 8, 2020 at 10:37 AM Ben Kietzman <
>>> b...@ursacomputing.com>
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > Hi Niranda,
>>> > > > > >
>>> > > > > > SumImpl is a subclass of KernelState. Given a
>>> SumAggregateKernel,
>>> > > one can
>>> > > > > > produce zeroed KernelState using the `init` member, then
>>> operate on
>>> > > data
>>> > > > > > using the `consume`, `merge`, and `finalize` members. You can
>>> look at
>>> > > > > > ScalarAggExecutor for an example of how to get from a compute
>>> > > function to
>>> > > > > > kernels and kernel state. Will that work for you?
>>> > > > > >
>>> > > > > > Ben Kietzman
>>> > > > > >
>>> > > > > > On Sun, Nov 8, 2020, 11:21 Niranda Perera <
>>> niranda.per...@gmail.com>
>>> > > > > wrote:
>>> > > > > >
>>> > > > > > > Hi Ben,
>>> > > > > > >
>>> > > > > > > We are building a distributed table abstraction on top of
>>> Arrow
>>> > > > > dataframes
>>> > > > > > > called Cylon (https://github.com/cylondata/cylon).
>>> Currently we
>>> > > have a
>>> > > > > > > simple aggregation and group-by operation implementation.
>>> But we
>>> > > felt
>>> > > > > like
>>> > > > > > > we can give more functionality if we can import arrow
>>> kernels and
>>> > > > > states to
>>> > > > > > > corresponding cylon distributed kernels.
>>> > > > > > > Ex: For distributed mean, we would have to communicate the
>>> local
>>> > > arrow
>>> > > > > > > SumState and then do a SumImpl::MergeFrom() and the call
>>> Finalize.
>>> > > > > > > Is there any other way to access these intermediate states
>>> from
>>> > > compute
>>> > > > > > > operations?
>>> > > > > > >
>>> > > > > > > On Sun, Nov 8, 2020 at 11:11 AM Ben Kietzman <
>>> > > b...@ursacomputing.com>
>>> > > > > > > wrote:
>>> > > > > > >
>>> > > > > > > > Ni Niranda,
>>> > > > > > > >
>>> > > > > > > > What is the context of your work? if you're working inside
>>> the
>>> > > arrow
>>> > > > > > > > repository you shouldn't need to install headers before
>>> using
>>> > > them,
>>> > > > > and
>>> > > > > > > we
>>> > > > > > > > welcome PRs for new kernels. Otherwise, could you provide
>>> some
>>> > > > > details
>>> > > > > > > > about how your work is using Arrow as a dependency?
>>> > > > > > > >
>>> > > > > > > > Ben Kietzman
>>> > > > > > > >
>>> > > > > > > > On Sun, Nov 8, 2020, 10:57 Niranda Perera <
>>> > > niranda.per...@gmail.com>
>>> > > > > > > > wrote:
>>> > > > > > > >
>>> > > > > > > > > Hi,
>>> > > > > > > > >
>>> > > > > > > > > I was wondering if I could use the
>>> > > > > arrow/compute/kernels/*internal.h
>>> > > > > > > > > headers in my work? I would like to reuse some of the
>>> kernel
>>> > > > > > > > > implementations and kernel states.
>>> > > > > > > > >
>>> > > > > > > > > With -DARROW_COMPUTE=ON, those headers are not added
>>> into the
>>> > > > > include
>>> > > > > > > > dir.
>>> > > > > > > > > I see that the *internal.h headers are skipped from
>>> > > > > > > > > the ARROW_INSTALL_ALL_HEADERS cmake function
>>> unfortunately.
>>> > > > > > > > >
>>> > > > > > > > > Best
>>> > > > > > > > > --
>>> > > > > > > > > Niranda Perera
>>> > > > > > > > > @n1r44 <https://twitter.com/N1R44>
>>> > > > > > > > > +1 812 558 8884 / +94 71 554 8430
>>> > > > > > > > > https://www.linkedin.com/in/niranda
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > --
>>> > > > > > > Niranda Perera
>>> > > > > > > @n1r44 <https://twitter.com/N1R44>
>>> > > > > > > +1 812 558 8884 / +94 71 554 8430
>>> > > > > > > https://www.linkedin.com/in/niranda
>>> > > > > > >
>>> > > > >
>>> > > >
>>> > > >
>>> > > > --
>>> > > > Niranda Perera
>>> > > > @n1r44 <https://twitter.com/N1R44>
>>> > > > +1 812 558 8884 / +94 71 554 8430
>>> > > > https://www.linkedin.com/in/niranda
>>> > >
>>> >
>>> >
>>> > --
>>> > Niranda Perera
>>> > @n1r44 <https://twitter.com/N1R44>
>>> > +1 812 558 8884 / +94 71 554 8430
>>> > https://www.linkedin.com/in/niranda
>>>
>>
>>
>> --
>> Niranda Perera
>> @n1r44 <https://twitter.com/N1R44>
>> +1 812 558 8884 / +94 71 554 8430
>> https://www.linkedin.com/in/niranda
>>
>

-- 
Niranda Perera
@n1r44 <https://twitter.com/N1R44>
+1 812 558 8884 / +94 71 554 8430
https://www.linkedin.com/in/niranda

Re: Using arrow/compute/kernels/*internal.h headers

Reply via email to