Re: [C++] Revamping approach to Arrow compute kernel development

Wes McKinney Sun, 24 May 2020 07:37:48 -0700

I have merged the patch but left the PR open for additional code review.


On Sat, May 23, 2020 at 3:24 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> To be clear given the scope of code affected I think we should merge it today 
> and address further feedback in a follow up patch. I will be diligent about 
> responding to additional comments in the PR
>
> On Sat, May 23, 2020, 3:19 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> Yes you should still be able to comment. I will reopen the PR after it is 
>> merged
>>
>> On Sat, May 23, 2020, 2:52 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>>>
>>> Hi Wes,
>>> Will we still be able to comment on the PR once it is closed?
>>>
>>>
>>> If we want to be inclusive on feedback it might pay to wait until Tuesday 
>>> evening US time to merge since it is a long weekend here.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Saturday, May 23, 2020, Wes McKinney <wesmck...@gmail.com> wrote:
>>>>
>>>> Hi folks -- I've addressed a good deal of feedback and added a lot of
>>>> comments and with Kou's help have got the build passing, It would be
>>>> great if this could be merged soon to unblock follow up PRs
>>>>
>>>> On Wed, May 20, 2020 at 11:55 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>>> >
>>>> > I just opened the PR https://github.com/apache/arrow/pull/7240
>>>> >
>>>> > I'm sorry it's so big. I really think this is the best way. The only
>>>> > further work I plan to do on it is to get the CI passing.
>>>> >
>>>> > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <wesmck...@gmail.com> 
>>>> > wrote:
>>>> > >
>>>> > > I'd guess I'm < 24 hours away from putting up my initial PR for this
>>>> > > work. Since the work is large and (for all practical purposes) nearly
>>>> > > impossible to separate into separately merge-ready PRs, I'll start a
>>>> > > new e-mail thread describing what I've done in more detail and
>>>> > > proposing a path for merging the PR (so as to unblock PRs into
>>>> > > arrow/compute and avoid rebasing headaches) and addressing review
>>>> > > feedback that will likely take several weeks to fully accumulate.
>>>> > >
>>>> > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <wesmck...@gmail.com> 
>>>> > > wrote:
>>>> > > >
>>>> > > > I'm working actively on this but perhaps as expected it has ballooned
>>>> > > > into a very large project -- it's unclear at the moment whether I'll
>>>> > > > be able to break the work into smaller patches that are easier to
>>>> > > > digest. I'm working as fast as I can to have an initial
>>>> > > > feature-preserving PR up, but the changes to the src/arrow/compute
>>>> > > > directory are extensive, so I would please ask that we do not merge
>>>> > > > non-essential patches into cpp/src/arrow/compute until I get the PR 
>>>> > > > up
>>>> > > > (estimated later this week at present rate) so we can see where 
>>>> > > > things
>>>> > > > stand.
>>>> > > >
>>>> > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <wesmck...@gmail.com> 
>>>> > > > wrote:
>>>> > > > >
>>>> > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield 
>>>> > > > > <emkornfi...@gmail.com> wrote:
>>>> > > > > >
>>>> > > > > > Hi Wes,
>>>> > > > > > I haven't had time to read the doc, but wanted to ask some 
>>>> > > > > > questions on
>>>> > > > > > points raised on the thread.
>>>> > > > > >
>>>> > > > > > * For efficiency, kernels used for array-expr evaluation should 
>>>> > > > > > write
>>>> > > > > > > into preallocated memory as their default mode. This enables 
>>>> > > > > > > the
>>>> > > > > > > interpreter to avoid temporary memory allocations and improve 
>>>> > > > > > > CPU
>>>> > > > > > > cache utilization. Almost none of our kernels are implemented 
>>>> > > > > > > this way
>>>> > > > > > > currently.
>>>> > > > > >
>>>> > > > > > Did something change, I was pretty sure I submitted a patch a 
>>>> > > > > > while ago for
>>>> > > > > > boolean kernels, that separated out memory allocation from 
>>>> > > > > > computation.
>>>> > > > > > Which should allow for writing to the same memory.  Is this a 
>>>> > > > > > concern with
>>>> > > > > > the public Function APIs for the Kernel APIs themselves, or a 
>>>> > > > > > lower level
>>>> > > > > > implementation concern?
>>>> > > > >
>>>> > > > > Yes, you did in the internal implementation [1]. The concern is the
>>>> > > > > public API and the general approach to implementing new kernels.
>>>> > > > >
>>>> > > > > I'm working on this right now (it's a large project so it will 
>>>> > > > > take me
>>>> > > > > a little while to produce something to be reviewed) so bear with 
>>>> > > > > me =)
>>>> > > > >
>>>> > > > > [1]: 
>>>> > > > > https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76
>>>> > > > >
>>>> > > > > > * Sorting is generally handled by different data processing 
>>>> > > > > > nodes from
>>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and 
>>>> > > > > > > Joins.
>>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>>> > > > > >
>>>> > > > > > Would sorting the list-column elements per row be an array-expr?
>>>> > > > >
>>>> > > > > Yes, as that's an element-wise function. When I said sorting I was
>>>> > > > > referring to ORDER BY. The functions we have that do sorting do so 
>>>> > > > > in
>>>> > > > > the context of a single array [2].
>>>> > > > >
>>>> > > > > A query engine must be able to sort a (potentially very large) 
>>>> > > > > stream
>>>> > > > > of record batches. One approach is for the Sort operator to exhaust
>>>> > > > > its child input, accumulating all of the record batches in memory
>>>> > > > > (spilling to disk as needed) and then sorting and emitting record
>>>> > > > > batches from the sorted records/tuples. See e.g. Impala's sorting 
>>>> > > > > code
>>>> > > > > [3] [4]
>>>> > > > >
>>>> > > > > [2]: 
>>>> > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34
>>>> > > > > [3]: 
>>>> > > > > https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h
>>>> > > > > [4]: 
>>>> > > > > https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h
>>>> > > > >
>>>> > > > > >
>>>> > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney 
>>>> > > > > > <wesmck...@gmail.com> wrote:
>>>> > > > > >
>>>> > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou 
>>>> > > > > > > <anto...@python.org> wrote:
>>>> > > > > > > >
>>>> > > > > > > >
>>>> > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit :
>>>> > > > > > > > >>
>>>> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a 
>>>> > > > > > > > >> problem,
>>>> > > > > > > since
>>>> > > > > > > > >> only the second pass writes to the output.
>>>> > > > > > > > >
>>>> > > > > > > > > This kernel is not valid for normal array-exprs (see the 
>>>> > > > > > > > > spreadsheet I
>>>> > > > > > > > > linked), such as what you can write in SQL
>>>> > > > > > > > >
>>>> > > > > > > > > Kernels like SortToIndices are a different type of 
>>>> > > > > > > > > function (in other
>>>> > > > > > > > > words, "not a SQL function") and so if we choose to allow 
>>>> > > > > > > > > such a
>>>> > > > > > > > > "non-SQL-like" functions in the expression evaluator then 
>>>> > > > > > > > > different
>>>> > > > > > > > > logic must be used.
>>>> > > > > > > >
>>>> > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level 
>>>> > > > > > > > we're
>>>> > > > > > > > talking here.  SortToIndices() may not be a "SQL function", 
>>>> > > > > > > > but it looks
>>>> > > > > > > > like an important basic block for a query engine (since, 
>>>> > > > > > > > after all,
>>>> > > > > > > > sorting results is an often used feature in SQL and other 
>>>> > > > > > > > languages).
>>>> > > > > > > > So it should be usable *inside* the expression engine, even 
>>>> > > > > > > > though it's
>>>> > > > > > > > not part of the exposed vocabulary, no?
>>>> > > > > > >
>>>> > > > > > > No, not as part of "expressions" as they are defined in the 
>>>> > > > > > > context of
>>>> > > > > > > SQL engines.
>>>> > > > > > >
>>>> > > > > > > Sorting is generally handled by different data processing 
>>>> > > > > > > nodes from
>>>> > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and 
>>>> > > > > > > Joins.
>>>> > > > > > > Projections and Filters use expressions, they do not sort.
>>>> > > > > > >
>>>> > > > > > > > Regards
>>>> > > > > > > >
>>>> > > > > > > > Antoine.
>>>> > > > > > >

Re: [C++] Revamping approach to Arrow compute kernel development

Reply via email to