Hi folks -- I've addressed a good deal of feedback and added a lot of comments and with Kou's help have got the build passing, It would be great if this could be merged soon to unblock follow up PRs
On Wed, May 20, 2020 at 11:55 PM Wes McKinney <wesmck...@gmail.com> wrote: > > I just opened the PR https://github.com/apache/arrow/pull/7240 > > I'm sorry it's so big. I really think this is the best way. The only > further work I plan to do on it is to get the CI passing. > > On Wed, May 20, 2020 at 12:26 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > I'd guess I'm < 24 hours away from putting up my initial PR for this > > work. Since the work is large and (for all practical purposes) nearly > > impossible to separate into separately merge-ready PRs, I'll start a > > new e-mail thread describing what I've done in more detail and > > proposing a path for merging the PR (so as to unblock PRs into > > arrow/compute and avoid rebasing headaches) and addressing review > > feedback that will likely take several weeks to fully accumulate. > > > > On Mon, May 11, 2020 at 5:17 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > I'm working actively on this but perhaps as expected it has ballooned > > > into a very large project -- it's unclear at the moment whether I'll > > > be able to break the work into smaller patches that are easier to > > > digest. I'm working as fast as I can to have an initial > > > feature-preserving PR up, but the changes to the src/arrow/compute > > > directory are extensive, so I would please ask that we do not merge > > > non-essential patches into cpp/src/arrow/compute until I get the PR up > > > (estimated later this week at present rate) so we can see where things > > > stand. > > > > > > On Wed, Apr 22, 2020 at 8:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > > On Wed, Apr 22, 2020 at 12:41 AM Micah Kornfield > > > > <emkornfi...@gmail.com> wrote: > > > > > > > > > > Hi Wes, > > > > > I haven't had time to read the doc, but wanted to ask some questions > > > > > on > > > > > points raised on the thread. > > > > > > > > > > * For efficiency, kernels used for array-expr evaluation should write > > > > > > into preallocated memory as their default mode. This enables the > > > > > > interpreter to avoid temporary memory allocations and improve CPU > > > > > > cache utilization. Almost none of our kernels are implemented this > > > > > > way > > > > > > currently. > > > > > > > > > > Did something change, I was pretty sure I submitted a patch a while > > > > > ago for > > > > > boolean kernels, that separated out memory allocation from > > > > > computation. > > > > > Which should allow for writing to the same memory. Is this a concern > > > > > with > > > > > the public Function APIs for the Kernel APIs themselves, or a lower > > > > > level > > > > > implementation concern? > > > > > > > > Yes, you did in the internal implementation [1]. The concern is the > > > > public API and the general approach to implementing new kernels. > > > > > > > > I'm working on this right now (it's a large project so it will take me > > > > a little while to produce something to be reviewed) so bear with me =) > > > > > > > > [1]: > > > > https://github.com/apache/arrow/commit/4910fbf4fda05b864daaba820db08291e4afdcb6#diff-561ea05d36150eb15842f452e3f07c76 > > > > > > > > > * Sorting is generally handled by different data processing nodes from > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins. > > > > > > Projections and Filters use expressions, they do not sort. > > > > > > > > > > Would sorting the list-column elements per row be an array-expr? > > > > > > > > Yes, as that's an element-wise function. When I said sorting I was > > > > referring to ORDER BY. The functions we have that do sorting do so in > > > > the context of a single array [2]. > > > > > > > > A query engine must be able to sort a (potentially very large) stream > > > > of record batches. One approach is for the Sort operator to exhaust > > > > its child input, accumulating all of the record batches in memory > > > > (spilling to disk as needed) and then sorting and emitting record > > > > batches from the sorted records/tuples. See e.g. Impala's sorting code > > > > [3] [4] > > > > > > > > [2]: > > > > https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/sort_to_indices.h#L34 > > > > [3]: > > > > https://github.com/apache/impala/blob/master/be/src/runtime/sorter.h > > > > [4]: > > > > https://github.com/apache/impala/blob/master/be/src/exec/sort-node.h > > > > > > > > > > > > > > On Tue, Apr 21, 2020 at 5:35 AM Wes McKinney <wesmck...@gmail.com> > > > > > wrote: > > > > > > > > > > > On Tue, Apr 21, 2020 at 7:32 AM Antoine Pitrou <anto...@python.org> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Le 21/04/2020 à 13:53, Wes McKinney a écrit : > > > > > > > >> > > > > > > > >> That said, in the SortToIndices case, this wouldn't be a > > > > > > > >> problem, > > > > > > since > > > > > > > >> only the second pass writes to the output. > > > > > > > > > > > > > > > > This kernel is not valid for normal array-exprs (see the > > > > > > > > spreadsheet I > > > > > > > > linked), such as what you can write in SQL > > > > > > > > > > > > > > > > Kernels like SortToIndices are a different type of function (in > > > > > > > > other > > > > > > > > words, "not a SQL function") and so if we choose to allow such a > > > > > > > > "non-SQL-like" functions in the expression evaluator then > > > > > > > > different > > > > > > > > logic must be used. > > > > > > > > > > > > > > Hmm, I think that maybe I'm misunderstanding at which level we're > > > > > > > talking here. SortToIndices() may not be a "SQL function", but > > > > > > > it looks > > > > > > > like an important basic block for a query engine (since, after > > > > > > > all, > > > > > > > sorting results is an often used feature in SQL and other > > > > > > > languages). > > > > > > > So it should be usable *inside* the expression engine, even > > > > > > > though it's > > > > > > > not part of the exposed vocabulary, no? > > > > > > > > > > > > No, not as part of "expressions" as they are defined in the context > > > > > > of > > > > > > SQL engines. > > > > > > > > > > > > Sorting is generally handled by different data processing nodes from > > > > > > Projections, Aggregations / Hash Aggregations, Filters, and Joins. > > > > > > Projections and Filters use expressions, they do not sort. > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > >