Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Phillip Cloud Wed, 11 Aug 2021 14:22:09 -0700

On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
[email protected]> wrote:


> Couple of questions
>
> 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the
> operation "(IR,data) - engine -> result" MUST be the same for all "engine"?
>

I think that might be a non-starter for mundane reasons: there's probably
at least two engines
that disagree on the result of sum(x) where x is a floating point column.


> 2. if yes, imo we may need to worry about:
> * a definition of equality that implementations agree on.
> * agreement over what the semantics look like. For example, do we use
> kleene logic for AND and OR?
>

WRT Kleene logic my thoughts are that the IR should support both Kleene and
non-Kleene,
and producers can choose their desired semantics.

Ibis for example, would override the `&` operator in `a & b` to produce
`KleeneAnd(Column(a), Column(b))`.


>
> To try to understand the gist, let's pick an aggregated count over a
> column: engines often do partial counts over partitions followed by a final
> "sum" over the partial counts. Is the idea that the query engine would
> communicate with the compute engine via 2 IRs where one is "count me these"
> the other is "sum me these"?
>
> Best,
> Jorge
>

Not in its current incarnation.

The idea is that the IR producer communicates a desire to count(x) to a
consumer, and  it's up to the consumer to figure out how to turn that count
into something that makes sense for itself. In your example that's a series
of partial counts followed by a sum.


>
>
>
>
>
> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <[email protected]> wrote:
>
> > Thanks Wes,
> >
> > Great to be back working on Arrow again and engaging with the community.
> I
> > am really excited about this effort.
> >
> > I think there are a number of concerns I see as important to address in
> the
> > compute IR proposal:
> >
> > 1. Requirement for output types.
> >
> > I think that so far there's been many reasons for requiring conforming IR
> > producers and consumers to adhere to output types, but I haven't seen a
> > strong rationale for keeping them optional (in the semantic sense, not
> WRT
> > any particular serialization format's representation of optionality).
> >
> > I think a design that includes unambiguous semantics for output types (a
> > consumer must produce a value of the requested type or it's an
> > error/non-conforming) is simpler to reason about for producers, and
> > provides a strong guarantee for end users (humans or machines
> constructing
> > IR from an API and expecting the thing they ask for back from the IR
> > consumer).
> >
> > 2. Flexibility
> >
> > The current PR is currently unable to support what I think are killer
> > features of the IR: custom operators (relational or column) and UDFs. In
> my
> > mind, on top of the generalized compute description that the IR offers,
> the
> > ability for producers and consumers of IR to extend the IR without
> needing
> > to modify Arrow or depend on anything except the format is itself
> something
> > that is necessary to gain adoption.
> >
> > Developers will need to build custom relational operators (e.g., scans of
> > backends that don't exist anywhere for which a user has code to
> implement)
> > and custom functions (anything operating on a column that doesn't already
> > exist, really). Furthermore, I think we can actually drive building an
> > Arrow consumer using the same API that an end user would use to extend
> the
> > IR.
> >
> > 3. Window Functions
> >
> > Window functions are, I think, an important part of the IR value
> > proposition, as they are one of the more complex operators in databases.
> I
> > think we need to have something in the initial IR proposal to support
> these
> > operations.
> >
> > 4. Non relational Joins
> >
> > Things like as-of join and window join operators aren't yet fleshed out
> in
> > the IR, and I'm not sure they should be in scope for the initial
> prototype.
> > I think once we settle on a design, we can work the design of these
> > particular operators out during the initial prototype. I think the
> > specification of these operators should basically be PR #2 after the
> > initial design lands.
> >
> > # Order of Work
> >
> > 1. Nail down the design. Anything else is a non-starter.
> >
> > 2. Prototype an IR producer using Ibis
> >
> > Ibis is IMO a good candidate for a first IR producer as it has a number
> of
> > desirable properties that make prototyping faster and allow for us to
> > refine the design of the IR as needed based on how the implementation
> goes:
> > * It's written in Python so it has native support for nearly all of
> > flatbuffers' features without having to creating bindings to C++.
> > * There's already a set of rules for type checking, as well as APIs for
> > constructing expression trees, which means we don't need to worry about
> > building a type checker for the prototype.
> >
> > 3. Prototype an IR consumer in C++
> >
> > I think in parallel to the producer prototype we can further inform the
> > design from the consumer side by prototyping an IR consumer in C++ . I
> know
> > Ben Kietzman has expressed interest in working on this.
> >
> > Very interested to hear others' thoughts.
> >
> > -Phillip
> >
> > On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <[email protected]>
> wrote:
> >
> > > Thank you for all the feedback and comments on the document. I'm on
> > > vacation this week, so I'm delayed responding to everything, but I
> > > will get to it as quickly as I can. I will be at VLDB in Copenhagen
> > > next week if anyone would like to chat in person about it, and we can
> > > relay the content of any discussions back to the document/PR/e-mail
> > > thread.
> > >
> > > I know that Phillip Cloud expressed interest in working on the PR and
> > > helping work through many of the details, so I'm glad to have the
> > > help. If there are others who would like to work on the PR or dig into
> > > the details, please let me know. We might need to figure out how to
> > > accommodate "many cooks" by setting up the ComputeIR project somewhere
> > > separate from the format/ directory to permit it to exist in a
> > > Work-In-Progress status for a period of time until we work through the
> > > various details and design concerns.
> > >
> > > Re Julian's comment
> > >
> > > > The biggest surprise is that this language does full relational
> > > operations. I was expecting that it would do fragments of the
> operations.
> > >
> > > There's a related but different (yet still interesting and worthy of
> > > analysis) problem of creating an "engine language" that describes more
> > > mechanically the constituent parts of implementing the relational
> > > operators. To create a functional computation language with concrete
> > > Arrow data structures as a top-level primitive sounds like an
> > > interesting research area where I could see something developing
> > > eventually.
> > >
> > > The main problem I'm interested in solving right now is enabling front
> > > ends that have sufficient understanding of relational algebra and data
> > > frame operations to talk to engines without having to go backwards
> > > from their logical query plans to SQL. So as mentioned in the
> > > document, being able to faithfully carry the relational operator node
> > > information generated by Calcite or Ibis or another system would be
> > > super useful. Defining the semantics of various kinds of user-defined
> > > functions would also be helpful to standardize the
> > > engine-to-user-language UDF/extension interface.
> > >
> > > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <[email protected]>
> > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > cool initiative! Reminded me of "Building Advanced SQL Analytics From
> > > > Low-Level Plan Operators" from SIGMOD 2021 (
> > > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which
> > proposes a
> > > > set of building block for advanced aggregation.
> > > >
> > > > Cheers,
> > > > Dimitri.
> > > >
> > > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <[email protected]>
> > > wrote:
> > > >
> > > > > Wes,
> > > > >
> > > > > Thanks for this. I’ve added comments to the doc and to the PR.
> > > > >
> > > > > The biggest surprise is that this language does full relational
> > > > > operations. I was expecting that it would do fragments of the
> > > operations.
> > > > > Consider join. A distributed hybrid hash join needs to partition
> rows
> > > into
> > > > > output buffers based on a hash key, build hash tables, probe into
> > hash
> > > > > tables, scan hash tables for untouched “outer”rows, and so forth.
> > > > >
> > > > > I see Arrow’s compute as delivering each of those operations,
> working
> > > on
> > > > > perhaps a batch at a time, or a sequence of batches, with some
> other
> > > system
> > > > > coordinating those tasks. So I would expect to see Arrow’s compute
> > > language
> > > > > mainly operating on batches rather than a table abstraction.
> > > > >
> > > > > Julian
> > > > >
> > > > >
> > > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <[email protected]>
> > > wrote:
> > > > > >
> > > > > > hi folks,
> > > > > >
> > > > > > This idea came up in passing in the past -- given that there are
> > > > > > multiple independent efforts to develop Arrow-native query
> engines
> > > > > > (and surely many more to come), it seems like it would be
> valuable
> > to
> > > > > > have a way to enable user languages (like Java, Python, R, or
> Rust,
> > > > > > for example) to communicate with backend computing engines (like
> > > > > > DataFusion, or new computing capabilities being built in the
> Arrow
> > > C++
> > > > > > library) in a fashion that is "lower-level" than SQL and
> > specialized
> > > > > > to Arrow's type system. So rather than leaving it to a SQL
> parser /
> > > > > > analyzer framework to generate an expression tree of relational
> > > > > > operators and then translate that to an Arrow-native
> > compute-engine's
> > > > > > internal grammer, a user framework could provide the desired
> > > > > > Arrow-native expression tree / data manipulations directly and
> skip
> > > > > > the SQL altogether.
> > > > > >
> > > > > > The idea of creating a "serialized intermediate representation
> > (IR)"
> > > > > > for Arrow compute operations would be to serve use cases large
> and
> > > > > > small -- from the most complex TPC-* or time series database
> query
> > to
> > > > > > the most simple array predicate/filter sent with an RPC request
> > using
> > > > > > Arrow Flight. It is deliberately language- and engine-agnostic,
> > with
> > > > > > the only commonality across usages being the Arrow columnar
> format
> > > > > > (schemas and array types). This would be better than leaving it
> to
> > > > > > each application to develop its own bespoke expression
> > > representations
> > > > > > for its needs.
> > > > > >
> > > > > > I spent a while thinking about this and wrote up a brain dump RFC
> > > > > > document [1] and accompanying pull request [2] that makes the
> > > possibly
> > > > > > controversial choice of using Flatbuffers to represent the
> > serialized
> > > > > > IR. I discuss the rationale for the choice of Flatbuffers in the
> > RFC
> > > > > > document. This PR is obviously deficient in many regards
> > (incomplete,
> > > > > > hacky, or unclear in places), and will need some help from others
> > to
> > > > > > flesh out. I suspect that actually implementing the IR will be
> > > > > > necessary to work out many of the low-level details.
> > > > > >
> > > > > > Note that this IR is intended to be more of a "superset" project
> > than
> > > > > > a "lowest common denominator". So there may be things that it
> > > includes
> > > > > > which are only available in some engines (e.g. engines that have
> > > > > > specialized handling of time series data).
> > > > > >
> > > > > > As some of my personal motivation for the project, concurrent
> with
> > > the
> > > > > > genesis of Apache Arrow, I started a Python project called Ibis
> [3]
> > > > > > (which is similar to R's dplyr project) which serves as a "Python
> > > > > > analytical query IR builder" that is capable of generating most
> of
> > > the
> > > > > > SQL standard, targeting many different SQL dialects and other
> > > backends
> > > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this
> > > library
> > > > > > as a "many-SQL" middleware. As such, I would like to be able to
> > > > > > translate between the in-memory "logical query" data structures
> in
> > a
> > > > > > library like Ibis to a serialized format that can be executed by
> > many
> > > > > > different Arrow-native query engines. The expression primitives
> > > > > > available in Ibis should serve as helpful test cases, too.
> > > > > >
> > > > > > I look forward to the community's comments on the RFC document
> [1]
> > > and
> > > > > > pull request [2] -- I realize that arriving at consensus on a
> > complex
> > > > > > and ambitious project like this can be challenging so I recommend
> > > > > > spending time on the "non-goals" section in the RFC and ask
> > questions
> > > > > > if you are unclear about the scope of what problems this is
> trying
> > to
> > > > > > solve. I would be happy to give Edit access on the RFC document
> to
> > > > > > others and would consider ideas about how to move forward with
> > > > > > something that is able to be implemented by different Arrow
> > libraries
> > > > > > in the reasonably near future.
> > > > > >
> > > > > > Thanks,
> > > > > > Wes
> > > > > >
> > > > > > [1]:
> > > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > [2]: https://github.com/apache/arrow/pull/10856
> > > > > > [3]: https://ibis-project.org/
> > > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > [5]:
> > > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > >
> > > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to