Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Arun Sharma Mon, 16 Aug 2021 21:44:05 -0700

Thank you for putting together this proposal. Very exciting development. I
left some comments in the RFC doc, summarized here as:


* Flatbuffer is usable as a serialization agnostic IDL (
https://adsharma.github.io/flattools/)
* serde library + msgpack is a worthy candidate to consider for
serialization in conjunction with flatbuffers for IDL and for large arrays
involving zero-copy.

I'm not super familiar with the type of problems this RFC or Ibis are
trying to solve, but wanted to draw attention to queries involving deeply
nested results and user defined functions in high level languages.

* I'm also the author of fquery (https://github.com/adsharma/fquery), which
has some similarities to ibis, but differing internals.
** Supports nested/graph queries. If a flat table is a basic abstraction in
Ibis, a nested dict with async iterable values is the fundamental
abstraction in fquery.
** This abstraction can support multi-model databases/engines (graph,
database and relational)
** Much smaller core (~2k lines of python), but a small number of
operators. Adding new operators is simple.

On Fri, Aug 13, 2021 at 2:26 PM Phillip Cloud <[email protected]> wrote:

> Hey all,
>
> Just wanted to give an update on the effort here.
>
> Ben Kietzman has created an alternative proposal to the initial design [1].
> It largely overlaps with the original, but differs in a few important ways:
>
> * A big focus of the design is on flexibility, allowing producers,
> consumers and ultimately end users of those systems the ability to define
> custom operations in the graph.
> * There are very few predefined relational operations (project, filter,
> join and a handful of others).
> * There are only 3 types of value expressions: literals, field references,
> and function calls.
> * The model of evaluation is one that requires a final sink node, to
> indicate where the record batches from child relations end up (a file, a
> socket, an in-memory buffer, etc).
>
> I've added notes [2] to the original Google doc (under the Alternative
> Design Notes subheading), and a few pseudocode examples.
> Unfortunately, these went out of date as soon as Ben pushed the PR [3], so
> I need to update those to reflect his changes. Regardless,
> the design is broadly the same, so it should still give a good indication
> of the details of the design.
>
> There are a decent number of review comments on the original PR that I plan
> to port over where they are still relevant.
> I also plan on adding support for window functions either tonight or on
> Monday.
>
> Please review this design at your earliest convenience. Since there's a
> fairly concrete set of types in flatbuffers that
> we can look at, ideally we can center discussion around the details in the
> PR.
>
> Thanks!
>
> [1]: https://github.com/apache/arrow/pull/10856
> [2]:
>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> [3]: https://github.com/apache/arrow/pull/10934
>
> On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <[email protected]>
> wrote:
>
> > > Wes wrote:
> > >
> > > Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> >
> > +1
> >
> > The data format is the natural interface between tasks. (Defining “task”
> > here as “something that is programmed using the IR”.) That is Arrow’s
> > strength.
> >
> > So I think the IR should describe what each task should do, and tasks
> > should be fairly small. Not whole relational operators, operating on
> whole
> > tables, but pieces of relational operators, operating on batches or
> > sequences of batches.
> >
> > Elsethread, someone mentioned the LoLePop concept and the
> Kohn/Leis/Neuman
> > paper [1]. The LoLePop concept sounds good for our purposes.
> >
> > Julian
> >
> > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> >
> >
> > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <[email protected]> wrote:
> > >
> > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <[email protected]>
> > wrote:
> > >>
> > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > >> [email protected]> wrote:
> > >>
> > >>> Couple of questions
> > >>>
> > >>> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data),
> the
> > >>> operation "(IR,data) - engine -> result" MUST be the same for all
> > "engine"?
> > >>>
> > >>
> > >> I think that might be a non-starter for mundane reasons: there's
> > probably
> > >> at least two engines
> > >> that disagree on the result of sum(x) where x is a floating point
> > column.
> > >>
> > >>
> > >>> 2. if yes, imo we may need to worry about:
> > >>> * a definition of equality that implementations agree on.
> > >>> * agreement over what the semantics look like. For example, do we use
> > >>> kleene logic for AND and OR?
> > >>>
> > >>
> > >> WRT Kleene logic my thoughts are that the IR should support both
> Kleene
> > and
> > >> non-Kleene,
> > >> and producers can choose their desired semantics.
> > >>
> > >> Ibis for example, would override the `&` operator in `a & b` to
> produce
> > >> `KleeneAnd(Column(a), Column(b))`.
> > >>
> > >>
> > >>>
> > >>> To try to understand the gist, let's pick an aggregated count over a
> > >>> column: engines often do partial counts over partitions followed by a
> > final
> > >>> "sum" over the partial counts. Is the idea that the query engine
> would
> > >>> communicate with the compute engine via 2 IRs where one is "count me
> > these"
> > >>> the other is "sum me these"?
> > >>>
> > >>> Best,
> > >>> Jorge
> > >>>
> > >>
> > >> Not in its current incarnation.
> > >>
> > >> The idea is that the IR producer communicates a desire to count(x) to
> a
> > >> consumer, and  it's up to the consumer to figure out how to turn that
> > count
> > >> into something that makes sense for itself. In your example that's a
> > series
> > >> of partial counts followed by a sum.
> > >>
> > >
> > > That said, I think there is a valid use case here where a system might
> > > make use of different engines to execute different (composable) layers
> > > of a complex query.
> > >
> > > For example:
> > >
> > > * suppose you want to scan and do predicate pushdown on an unusual
> > > data source that is only accessible from one particular engine but
> > > * you need to do some analytical operation with the scan results that
> > > is only supported by another engine
> > >
> > > You could decompose the query into two stages with an IR relational
> > > expression for each stage and use then the engines together to
> > > accomplish what you need (of course, you would need an orchestration
> > > layer to handle plumbing the query engine inputs and outputs together
> > > as Arrow streams). Supporting this kind of intra-application engine
> > > heterogeneity is one of the motivations for the project.
> > >
> > >>>
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <[email protected]>
> > wrote:
> > >>>
> > >>>> Thanks Wes,
> > >>>>
> > >>>> Great to be back working on Arrow again and engaging with the
> > community.
> > >>> I
> > >>>> am really excited about this effort.
> > >>>>
> > >>>> I think there are a number of concerns I see as important to address
> > in
> > >>> the
> > >>>> compute IR proposal:
> > >>>>
> > >>>> 1. Requirement for output types.
> > >>>>
> > >>>> I think that so far there's been many reasons for requiring
> > conforming IR
> > >>>> producers and consumers to adhere to output types, but I haven't
> seen
> > a
> > >>>> strong rationale for keeping them optional (in the semantic sense,
> not
> > >>> WRT
> > >>>> any particular serialization format's representation of
> optionality).
> > >>>>
> > >>>> I think a design that includes unambiguous semantics for output
> types
> > (a
> > >>>> consumer must produce a value of the requested type or it's an
> > >>>> error/non-conforming) is simpler to reason about for producers, and
> > >>>> provides a strong guarantee for end users (humans or machines
> > >>> constructing
> > >>>> IR from an API and expecting the thing they ask for back from the IR
> > >>>> consumer).
> > >>>>
> > >>>> 2. Flexibility
> > >>>>
> > >>>> The current PR is currently unable to support what I think are
> killer
> > >>>> features of the IR: custom operators (relational or column) and
> UDFs.
> > In
> > >>> my
> > >>>> mind, on top of the generalized compute description that the IR
> > offers,
> > >>> the
> > >>>> ability for producers and consumers of IR to extend the IR without
> > >>> needing
> > >>>> to modify Arrow or depend on anything except the format is itself
> > >>> something
> > >>>> that is necessary to gain adoption.
> > >>>>
> > >>>> Developers will need to build custom relational operators (e.g.,
> > scans of
> > >>>> backends that don't exist anywhere for which a user has code to
> > >>> implement)
> > >>>> and custom functions (anything operating on a column that doesn't
> > already
> > >>>> exist, really). Furthermore, I think we can actually drive building
> an
> > >>>> Arrow consumer using the same API that an end user would use to
> extend
> > >>> the
> > >>>> IR.
> > >>>>
> > >>>> 3. Window Functions
> > >>>>
> > >>>> Window functions are, I think, an important part of the IR value
> > >>>> proposition, as they are one of the more complex operators in
> > databases.
> > >>> I
> > >>>> think we need to have something in the initial IR proposal to
> support
> > >>> these
> > >>>> operations.
> > >>>>
> > >>>> 4. Non relational Joins
> > >>>>
> > >>>> Things like as-of join and window join operators aren't yet fleshed
> > out
> > >>> in
> > >>>> the IR, and I'm not sure they should be in scope for the initial
> > >>> prototype.
> > >>>> I think once we settle on a design, we can work the design of these
> > >>>> particular operators out during the initial prototype. I think the
> > >>>> specification of these operators should basically be PR #2 after the
> > >>>> initial design lands.
> > >>>>
> > >>>> # Order of Work
> > >>>>
> > >>>> 1. Nail down the design. Anything else is a non-starter.
> > >>>>
> > >>>> 2. Prototype an IR producer using Ibis
> > >>>>
> > >>>> Ibis is IMO a good candidate for a first IR producer as it has a
> > number
> > >>> of
> > >>>> desirable properties that make prototyping faster and allow for us
> to
> > >>>> refine the design of the IR as needed based on how the
> implementation
> > >>> goes:
> > >>>> * It's written in Python so it has native support for nearly all of
> > >>>> flatbuffers' features without having to creating bindings to C++.
> > >>>> * There's already a set of rules for type checking, as well as APIs
> > for
> > >>>> constructing expression trees, which means we don't need to worry
> > about
> > >>>> building a type checker for the prototype.
> > >>>>
> > >>>> 3. Prototype an IR consumer in C++
> > >>>>
> > >>>> I think in parallel to the producer prototype we can further inform
> > the
> > >>>> design from the consumer side by prototyping an IR consumer in C++
> . I
> > >>> know
> > >>>> Ben Kietzman has expressed interest in working on this.
> > >>>>
> > >>>> Very interested to hear others' thoughts.
> > >>>>
> > >>>> -Phillip
> > >>>>
> > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <[email protected]>
> > >>> wrote:
> > >>>>
> > >>>>> Thank you for all the feedback and comments on the document. I'm on
> > >>>>> vacation this week, so I'm delayed responding to everything, but I
> > >>>>> will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > >>>>> next week if anyone would like to chat in person about it, and we
> can
> > >>>>> relay the content of any discussions back to the document/PR/e-mail
> > >>>>> thread.
> > >>>>>
> > >>>>> I know that Phillip Cloud expressed interest in working on the PR
> and
> > >>>>> helping work through many of the details, so I'm glad to have the
> > >>>>> help. If there are others who would like to work on the PR or dig
> > into
> > >>>>> the details, please let me know. We might need to figure out how to
> > >>>>> accommodate "many cooks" by setting up the ComputeIR project
> > somewhere
> > >>>>> separate from the format/ directory to permit it to exist in a
> > >>>>> Work-In-Progress status for a period of time until we work through
> > the
> > >>>>> various details and design concerns.
> > >>>>>
> > >>>>> Re Julian's comment
> > >>>>>
> > >>>>>> The biggest surprise is that this language does full relational
> > >>>>> operations. I was expecting that it would do fragments of the
> > >>> operations.
> > >>>>>
> > >>>>> There's a related but different (yet still interesting and worthy
> of
> > >>>>> analysis) problem of creating an "engine language" that describes
> > more
> > >>>>> mechanically the constituent parts of implementing the relational
> > >>>>> operators. To create a functional computation language with
> concrete
> > >>>>> Arrow data structures as a top-level primitive sounds like an
> > >>>>> interesting research area where I could see something developing
> > >>>>> eventually.
> > >>>>>
> > >>>>> The main problem I'm interested in solving right now is enabling
> > front
> > >>>>> ends that have sufficient understanding of relational algebra and
> > data
> > >>>>> frame operations to talk to engines without having to go backwards
> > >>>>> from their logical query plans to SQL. So as mentioned in the
> > >>>>> document, being able to faithfully carry the relational operator
> node
> > >>>>> information generated by Calcite or Ibis or another system would be
> > >>>>> super useful. Defining the semantics of various kinds of
> user-defined
> > >>>>> functions would also be helpful to standardize the
> > >>>>> engine-to-user-language UDF/extension interface.
> > >>>>>
> > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <[email protected]>
> > >>>> wrote:
> > >>>>>>
> > >>>>>> Hi Wes,
> > >>>>>>
> > >>>>>> cool initiative! Reminded me of "Building Advanced SQL Analytics
> > From
> > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > >>>> proposes a
> > >>>>>> set of building block for advanced aggregation.
> > >>>>>>
> > >>>>>> Cheers,
> > >>>>>> Dimitri.
> > >>>>>>
> > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> [email protected]>
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Wes,
> > >>>>>>>
> > >>>>>>> Thanks for this. I’ve added comments to the doc and to the PR.
> > >>>>>>>
> > >>>>>>> The biggest surprise is that this language does full relational
> > >>>>>>> operations. I was expecting that it would do fragments of the
> > >>>>> operations.
> > >>>>>>> Consider join. A distributed hybrid hash join needs to partition
> > >>> rows
> > >>>>> into
> > >>>>>>> output buffers based on a hash key, build hash tables, probe into
> > >>>> hash
> > >>>>>>> tables, scan hash tables for untouched “outer”rows, and so forth.
> > >>>>>>>
> > >>>>>>> I see Arrow’s compute as delivering each of those operations,
> > >>> working
> > >>>>> on
> > >>>>>>> perhaps a batch at a time, or a sequence of batches, with some
> > >>> other
> > >>>>> system
> > >>>>>>> coordinating those tasks. So I would expect to see Arrow’s
> compute
> > >>>>> language
> > >>>>>>> mainly operating on batches rather than a table abstraction.
> > >>>>>>>
> > >>>>>>> Julian
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <[email protected]>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>> hi folks,
> > >>>>>>>>
> > >>>>>>>> This idea came up in passing in the past -- given that there are
> > >>>>>>>> multiple independent efforts to develop Arrow-native query
> > >>> engines
> > >>>>>>>> (and surely many more to come), it seems like it would be
> > >>> valuable
> > >>>> to
> > >>>>>>>> have a way to enable user languages (like Java, Python, R, or
> > >>> Rust,
> > >>>>>>>> for example) to communicate with backend computing engines (like
> > >>>>>>>> DataFusion, or new computing capabilities being built in the
> > >>> Arrow
> > >>>>> C++
> > >>>>>>>> library) in a fashion that is "lower-level" than SQL and
> > >>>> specialized
> > >>>>>>>> to Arrow's type system. So rather than leaving it to a SQL
> > >>> parser /
> > >>>>>>>> analyzer framework to generate an expression tree of relational
> > >>>>>>>> operators and then translate that to an Arrow-native
> > >>>> compute-engine's
> > >>>>>>>> internal grammer, a user framework could provide the desired
> > >>>>>>>> Arrow-native expression tree / data manipulations directly and
> > >>> skip
> > >>>>>>>> the SQL altogether.
> > >>>>>>>>
> > >>>>>>>> The idea of creating a "serialized intermediate representation
> > >>>> (IR)"
> > >>>>>>>> for Arrow compute operations would be to serve use cases large
> > >>> and
> > >>>>>>>> small -- from the most complex TPC-* or time series database
> > >>> query
> > >>>> to
> > >>>>>>>> the most simple array predicate/filter sent with an RPC request
> > >>>> using
> > >>>>>>>> Arrow Flight. It is deliberately language- and engine-agnostic,
> > >>>> with
> > >>>>>>>> the only commonality across usages being the Arrow columnar
> > >>> format
> > >>>>>>>> (schemas and array types). This would be better than leaving it
> > >>> to
> > >>>>>>>> each application to develop its own bespoke expression
> > >>>>> representations
> > >>>>>>>> for its needs.
> > >>>>>>>>
> > >>>>>>>> I spent a while thinking about this and wrote up a brain dump
> RFC
> > >>>>>>>> document [1] and accompanying pull request [2] that makes the
> > >>>>> possibly
> > >>>>>>>> controversial choice of using Flatbuffers to represent the
> > >>>> serialized
> > >>>>>>>> IR. I discuss the rationale for the choice of Flatbuffers in the
> > >>>> RFC
> > >>>>>>>> document. This PR is obviously deficient in many regards
> > >>>> (incomplete,
> > >>>>>>>> hacky, or unclear in places), and will need some help from
> others
> > >>>> to
> > >>>>>>>> flesh out. I suspect that actually implementing the IR will be
> > >>>>>>>> necessary to work out many of the low-level details.
> > >>>>>>>>
> > >>>>>>>> Note that this IR is intended to be more of a "superset" project
> > >>>> than
> > >>>>>>>> a "lowest common denominator". So there may be things that it
> > >>>>> includes
> > >>>>>>>> which are only available in some engines (e.g. engines that have
> > >>>>>>>> specialized handling of time series data).
> > >>>>>>>>
> > >>>>>>>> As some of my personal motivation for the project, concurrent
> > >>> with
> > >>>>> the
> > >>>>>>>> genesis of Apache Arrow, I started a Python project called Ibis
> > >>> [3]
> > >>>>>>>> (which is similar to R's dplyr project) which serves as a
> "Python
> > >>>>>>>> analytical query IR builder" that is capable of generating most
> > >>> of
> > >>>>> the
> > >>>>>>>> SQL standard, targeting many different SQL dialects and other
> > >>>>> backends
> > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > >>>>> library
> > >>>>>>>> as a "many-SQL" middleware. As such, I would like to be able to
> > >>>>>>>> translate between the in-memory "logical query" data structures
> > >>> in
> > >>>> a
> > >>>>>>>> library like Ibis to a serialized format that can be executed by
> > >>>> many
> > >>>>>>>> different Arrow-native query engines. The expression primitives
> > >>>>>>>> available in Ibis should serve as helpful test cases, too.
> > >>>>>>>>
> > >>>>>>>> I look forward to the community's comments on the RFC document
> > >>> [1]
> > >>>>> and
> > >>>>>>>> pull request [2] -- I realize that arriving at consensus on a
> > >>>> complex
> > >>>>>>>> and ambitious project like this can be challenging so I
> recommend
> > >>>>>>>> spending time on the "non-goals" section in the RFC and ask
> > >>>> questions
> > >>>>>>>> if you are unclear about the scope of what problems this is
> > >>> trying
> > >>>> to
> > >>>>>>>> solve. I would be happy to give Edit access on the RFC document
> > >>> to
> > >>>>>>>> others and would consider ideas about how to move forward with
> > >>>>>>>> something that is able to be implemented by different Arrow
> > >>>> libraries
> > >>>>>>>> in the reasonably near future.
> > >>>>>>>>
> > >>>>>>>> Thanks,
> > >>>>>>>> Wes
> > >>>>>>>>
> > >>>>>>>> [1]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > >>>>>>>> [3]: https://ibis-project.org/
> > >>>>>>>> [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > >>>>>>>> [5]:
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to