Thanks Sasha — this is helpful. I'm going to take a college try at just the scalar kernels and see what I can accomplish over the next few days — will attempt to get a PR up for review with the C++ tests passing. I'm expecting assorted workarounds for the various kernels that do zero-copy optimizations (setting output buffers with input buffers — such optimizations should likely be carried out elsewhere), etc., but I will keep notes about challenges that I run into to assist with further refactoring and improvements.
On Fri, Jun 3, 2022 at 1:20 PM Sasha Krassovsky <krassovskysa...@gmail.com> wrote: > > Hi all, > I’ve been thinking about some sort of refactoring of this registry for a > while now, and I’ve written down some thoughts, please leave your comments. > > https://docs.google.com/document/d/1LAN9I_Y9cZaG2a84j1wLY8jSlK3gDXYMle-VtyFCAE8/edit?usp=sharing > > <https://docs.google.com/document/d/1LAN9I_Y9cZaG2a84j1wLY8jSlK3gDXYMle-VtyFCAE8/edit?usp=sharing> > > I agree with Weston that starting with a small set of kernels and drilling > down the implementation until we have a design with the characteristics we > want seems like a good idea. One point to consider is the tradeoff between > the size of our set of guinea pig kernels and how representative they are of > the rest. We will probably need to cover nested types as Weston said, > aggregations, etc. > > Sasha > > > On Jun 3, 2022, at 2:35 AM, Weston Pace <weston.p...@gmail.com> wrote: > > > > That approach looks great and very much in line with some of the stuff > > we have in light_array.h so I think it's very compatible. If you have > > the time to push this refactoring through then go for it. Don't let > > anything I'm saying deter any ongoing efforts. > > > > I'm just advocating that we be open to any kind of incremental > > approach. I looked a little closer at this today and you and Antoine > > are correct that "two function registries" is maybe not the right > > extension point. Instead it might be that we have two classes of > > scalar functions. For example, the "at most three buffers" ArraySpan > > that you've linked only works for types without children. We do have > > some scalar functions that run on nested types (e.g. count, index, > > etc.) My point is that rather than force ourselves to deal with all > > of these corner cases we should be open to an incremental solution. > > Maybe there is a ScalarPrimitive and a ScalarNested kernel type or > > Scalar and ScalarSpan. > > > > Also, if we know we are also going to want to tweak the output > > interface (I don't know for sure if we will) then maybe it makes sense > > to pick a small set of kernels and incrementally improve that small > > set until we think we've made all the changes we are going to want for > > the near future, and then promote those changes to the wider set of > > kernels. > > > > On Thu, Jun 2, 2022 at 8:07 AM Wes McKinney <wesmck...@gmail.com> wrote: > >> > >> On this topic, I actually have started prototyping a new ScalarKernel > >> exec interface that uses a non-owning, shared_ptr-free "ArraySpan" > >> data structure based on some prior conversations > >> > >> https://github.com/wesm/arrow/blob/711fd5e5665c280540bbaf48a48ca1eca1b91bff/cpp/src/arrow/compute/exec.h#L163 > >> https://github.com/wesm/arrow/blob/711fd5e5665c280540bbaf48a48ca1eca1b91bff/cpp/src/arrow/compute/kernel.h#L552 > >> > >> My idea was to refactor just the existing scalar kernels (which by my > >> estimation seems tractable since so many go through the common > >> templated kernel generators) and leave the other kinds of kernels > >> untouched so that they can be refactored in due course. I hadn't > >> planned to try to tackle changing the "output" interface from > >> shared_ptr<ArrayData> because there are some kernels that create new > >> ArrayData and some that don't. > >> > >> If this seems like a reasonable path, I can try to have a first draft > >> PR ready to go maybe by Monday (I was going to work on this over the > >> weekend when I can have some uninterrupted time to do the > >> refactoring). I'm not sure that a new registry is going to be needed > >> > >> On Thu, Jun 2, 2022 at 2:50 AM Antoine Pitrou <anto...@python.org> wrote: > >>> > >>> > >>> Le 02/06/2022 à 00:02, Weston Pace a écrit : > >>>> > >>>> I'd like to propose we add a second kernel function registry. There > >>>> doesn't need to be any user facing API change. We could probably use > >>>> an approach like [2] to proxy to the old function registry when the > >>>> newer registry doesn't contain the asked-for function. This would > >>>> allow us to focus on creating an efficient function registry without > >>>> having to worry about refactoring the existing kernels all at once. > >>> > >>> Why do we need a separate registry for this? >