Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Phillip Cloud Mon, 30 Aug 2021 14:10:28 -0700

Hey everyone,

There's some interesting discussion around types and where their location
is in the current PR [1] (and in fact whether to store them at all).


It would be great to get some community feedback on this [2] part of the PR
in particular, because the choice of whether to store types at all has
important design consequences.

[1]: https://github.com/apache/arrow/pull/10934
[2]: https://github.com/apache/arrow/pull/10934/files#r697025313

On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <[email protected]>
wrote:

> As an FYI, Iceberg is also considering an IR in relation to view support
> [1].  I chimed in and pointed them to this thread and Wes's doc.  Phillip
> and Jacques chimed in there as well.
>
> [1]
>
> https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E
>
> On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <[email protected]> wrote:
>
> > Thanks for the feedback Jacques, very helpful. In the latest version of
> the
> > PR, I've tried to incorporate nearly all of these points.
> >
> > - I've incorporated most of what you had for dereferencing operations
> into
> > the PR, and gotten rid of schemas except on Read/Write relations.
> > - With respect to properties, I've made a bunch more specific operators,
> > and kept user-defined things special case-y.
> > - I haven't incorporated anything close to physical-plan things, but I
> > think that's a good follow up PR. Having separate representations for
> > logical/physical plans seems like a waste of effort ultimately. I think
> we
> > can find a good balance.
> > - Agree on UDF support, I think that will have to evolve as the rest of
> the
> > spec evolves. UDFs will need language-dedicated effort given the large
> > variety of languages that folks will want to use to define functions.
> >
> > On a separate note, in an effort to move this project forward I'd like to
> > do one final round of code review from anyone interested and then merge
> the
> > PR after that.
> > This spec will be unstable for a while, until we can work out all the
> > design kinks and edge cases, and I think getting this initial spec in is
> > important to start that process.
> >
> >
> > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <[email protected]>
> wrote:
> >
> > > In a lucky turn of events, Phillip actually turned out to be in my neck
> > of
> > > the woods on Friday so we had a chance to sit down and discuss this. To
> > > help, I actually shared something I had been working on a few months
> ago
> > > independently (before this discussion started).
> > >
> > > For reference:
> > > Wes PR: https://github.com/apache/arrow/pull/10856
> > > Ben PR: https://github.com/apache/arrow/pull/10934
> > > Jacques PR: https://github.com/apache/arrow/pull/10979
> > >
> > > The high level points of feedback I have are:
> > >
> > >    - Ben PR feels too deconstructed. While I like the elegance and
> > >    symmetry, I believe this will lead to substantially more work in
> > moving
> > >    from serialization format to something closer to what a system would
> > > want
> > >    to manipulate/consume. The reality is that there are a lot of really
> > > known
> > >    things and specializing the representation for these things will
> > > ultimately
> > >    make things easier to program with without error and easier to
> debug.
> > > (For
> > >    example, imagine trying to inspect a plan in a debugging session
> with
> > > the
> > >    Ben representation.) We should embrace the known things in the
> > >    specification.
> > >    - I believe that it is a mistake for the inner workings of the plan
> to
> > >    ever use field names. Only input (e.g. file read) and Output (e.g.
> > > return
> > >    to user or write to file) need to have field names. For the rest of
> > the
> > >    system, using field ordinals (determinant whether nested or flat) is
> > > much
> > >    cleaner and is how most execution systems work. For example, in
> > Impala I
> > >    believe it is called a slot. As I noted in the PR, Calcite as an
> > > example is
> > >    entirely ordinal based at the algebra level. Rowtypes contain field
> > > names
> > >    but they are actually basically pointless. Field references use
> > > RexInputRef
> > >    with ordinal based and rules around column order output (e.g. what
> is
> > > the
> > >    field order of a join output) are determinant and done entirely at
> an
> > >    ordinal level. The only place where names should be used (besides
> > >    input/output) is in the case of map keys. In that case, the names
> are
> > >    actually data, as opposed to scheme metadata. This is why I propose
> a
> > >    strongly structured dereference operation [1].
> > >    - Properties should only be included in the serialization if they
> are
> > >    not easily re-derivable at plan consumption time. For example,
> you'll
> > > note
> > >    that I don't store schema information for relational operation. Each
> > >    function and relational operation should already know how a given
> > input
> > > is
> > >    transformed to a given output. Capturing this information in the
> > > plan/IR is
> > >    excessive. In many ways, I compare it to the early use of
> VectorLayout
> > > [2]
> > >    in Arrow schema. It may have provided some additional checksum of
> the
> > >    message but ultimately it caused more pain than it was worth (and
> thus
> > > we
> > >    removed it before formalizing the specification). For reference, in
> > the
> > >    context of Dremio, we used to actually do this, send schema
> > information
> > >    around for all operations. We removed it because in many cases
> > becoming
> > > the
> > >    majority of our internal plan serialization (imagine simple
> operations
> > > that
> > >    are carrying 1000s of fields).
> > >    - I suggest focusing on support for both logical and physical
> > >    representations. The moment you start talking about optimization
> > passes,
> > >    many of those would probably be better being done at the logical
> > level.
> > > The
> > >    overlap is really high.
> > >    - I think a lot more work must be done before introducing UDFs and
> > user
> > >    defined relational operations. I see one goal being the possibility
> of
> > >    there being three systems: A -> B -> C. A is a IR producer. C is a
> IR
> > >    consumer and B is a IR filter or translator. In this situation, B
> > > should be
> > >    able to operate and do optimizations on a plan even if if there are
> > > black
> > >    box user defined operations. Being able to know the
> > > properties-preservation
> > >    or not of these operations is important to making decisions. For
> > > example,
> > >    does a user defined relational operation maintain sortedness?
> > > Distribution?
> > >    Is a defined UDF idempotent? As such, I think the definition of
> those
> > > black
> > >    boxes should be much more structured. For example: it is a python
> > >    relational operation named X stored in Y that maintains properties
> 1,2
> > > and
> > >    disrupts property 3. Putting just a black box of bytes will
> > > substantially
> > >    reduce the compatibility and extensibility of the ecosystem of tools
> > >    working against IR. I'd note that I wouldn't expect this to be a
> > burden
> > > to
> > >    actual end users. By using sensible defaults, I still would expect
> an
> > > end
> > >    user tool to support arbitrary user defined operations.
> > >    - It might make sense to review the XML representation that Orca
> uses
> > >    [3]. I haven't looked at it recently but they had a strong goal of
> > >    decoupling for most of its life (to use in both Greenplum and Hawq).
> > It
> > >    could be the most mature/formal serialization of query plans
> > publically
> > >    available.
> > >
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34
> > > [2]
> > >
> > >
> >
> https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0
> > > [3] https://github.com/greenplum-db/gporca
> > >
> > >
> > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <[email protected]>
> > wrote:
> > >
> > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <[email protected]>
> > > wrote:
> > > >
> > > > > Looking at Ben's alternate PR [1], having an IR that leans heavily
> on
> > > > > memory references to an out-of-band data sidecar seems like an
> > > > > approach that would substantially ratchet up the implementation
> > > > > complexity as producing the IR would then have the level of
> > complexity
> > > > > of producing the Arrow IPC format — when producing the "root" Plan
> > > > > message, you must accumulate a list of the dependent serialized
> > > > > submessages, placing the appropriate Buffer memory offset in the
> Plan
> > > > > message, like we do when producing the RecordBatch.buffers field.
> > This
> > > > > seems complicated to me as you must devise a custom binary protocol
> > to
> > > > > concatenate the serialized Plan and the messages it depends on
> into a
> > > > > single binary payload
> > > > >
> > > > > <ROOT PLAN>
> > > > > <padding>
> > > > > <Buffer 0>
> > > > > <padding>
> > > > > <Buffer 1>
> > > > > <padding>
> > > > > ...
> > > > > <Buffer N - 1>
> > > > >
> > > > > (one purpose of FlatBufferBuilder is to spare you having to do this
> > > > > yourself — some reasons we do it for the Arrow IPC format is
> because
> > > > > appending Arrow memory buffers directly to a FlatBufferBuilder
> would
> > > > > be inefficient — internal realloc calls — and Flatbuffers are
> limited
> > > > > to 2GB. Neither of these things are problems here)
> > > > >
> > > > > In general, I believe the output created by an IR producer should
> be
> > a
> > > > > single serialized object without any out-of-band data sidecar —
> this
> > > > > is much simpler for implementers and we can still provide an
> "escape
> > > > > hatch" for user-defined operators and functions where the required
> > > > > function/operator is passed opaquely as an embedded binary data.
> > > >
> > > >
> > > >
> > > > The serialization format (whether it is Flatbuffers or JSON, or
> > > > > something else) should allow for data memoization, so if there is a
> > > > > user-defined operator/function, or a relation that is used multiple
> > > > > times throughout the query (potentially with a large schema), then
> we
> > > > > should ensure that the data need not be duplicated in the
> > > > > serialization format unnecessarily — in Flatbuffers, you can
> achieve
> > > > > this by reusing offsets, but we could devise the data structures to
> > > > > make the memoization of "expensive" objects more explicit.
> > > > >
> > > >
> > > > I think this is something that would need to be explicitly encoded in
> > > > the structures themselves if it's a hard requirement. I don't think
> > this
> > > > should block
> > > > a prototype producer/consumer.
> > > >
> > > > Is there something in the second PR/design that precludes the reuse
> of
> > > > offsets?
> > > > To my eye, the flatbuffers offset reuse mechanism works just as well
> > > there.
> > > >
> > > >
> > > > > I additionally think that it is important to provide as much
> built-in
> > > > > support for "canonical" operators/functions (such as the ones
> > > > > implemented commonly by SQL engines) as possible, and to liberally
> > > > > expand the catalogue of "built-in" capabilities. I would still
> prefer
> > > > > to have large unions/enums of built-in operators/functions and to
> > > > > expand those unions/enums to accommodate new things when it is
> > > > > demonstrated that there is a need to standardize things between
> > > > > producers/consumers.
> > > > >
> > > >
> > > > I think there's a middle ground where we add a bit of structure
> > > (something
> > > > like
> > > > a descriptor from the first PR) to indicate whether a thing is
> built-in
> > > vs
> > > > user-defined.
> > > > It looks like Ben has pushed something like this to his PR.
> > > >
> > > > With that scheme, we have both flexibility and a small set of special
> > > > builtins that make up
> > > > a statically typed set for expressions and relational operators.
> > > >
> > > > I would really like to vet this PR with a prototype this week,
> > > > to see whether we need to revisit any major choices. I don't think
> > we'll
> > > be
> > > > able to
> > > > anticipate all the consequences until we write some code.
> > > >
> > > >
> > > > >
> > > > > One of the beneficial properties of the Union/Enum approach for the
> > > > > operator/function catalogues, is that when there are additions to
> > > > > those enums, the generated Flatbuffers files will cause many
> language
> > > > > compilers to warn or error on unhandled enum cases. If all
> > > > > function/operator names are strings, then you are essentially
> > > > > reimplementing the functionality provided by enums by hand. I
> > > > > initially used strings for all function references in my original
> > > > > prototype, but I now think that using an enum for "built-ins" would
> > be
> > > > > superior (because of the code-generated enum interface) and not a
> > > > > premature optimization.
> > > > >
> > > > > [1]: https://github.com/apache/arrow/pull/10934
> > > > >
> > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > Hey all,
> > > > > >
> > > > > > Just wanted to give an update on the effort here.
> > > > > >
> > > > > > Ben Kietzman has created an alternative proposal to the initial
> > > design
> > > > > [1].
> > > > > > It largely overlaps with the original, but differs in a few
> > important
> > > > > ways:
> > > > > >
> > > > > > * A big focus of the design is on flexibility, allowing
> producers,
> > > > > > consumers and ultimately end users of those systems the ability
> to
> > > > define
> > > > > > custom operations in the graph.
> > > > > > * There are very few predefined relational operations (project,
> > > filter,
> > > > > > join and a handful of others).
> > > > > > * There are only 3 types of value expressions: literals, field
> > > > > references,
> > > > > > and function calls.
> > > > > > * The model of evaluation is one that requires a final sink node,
> > to
> > > > > > indicate where the record batches from child relations end up (a
> > > file,
> > > > a
> > > > > > socket, an in-memory buffer, etc).
> > > > > >
> > > > > > I've added notes [2] to the original Google doc (under the
> > > Alternative
> > > > > > Design Notes subheading), and a few pseudocode examples.
> > > > > > Unfortunately, these went out of date as soon as Ben pushed the
> PR
> > > [3],
> > > > > so
> > > > > > I need to update those to reflect his changes. Regardless,
> > > > > > the design is broadly the same, so it should still give a good
> > > > indication
> > > > > > of the details of the design.
> > > > > >
> > > > > > There are a decent number of review comments on the original PR
> > that
> > > I
> > > > > plan
> > > > > > to port over where they are still relevant.
> > > > > > I also plan on adding support for window functions either tonight
> > or
> > > on
> > > > > > Monday.
> > > > > >
> > > > > > Please review this design at your earliest convenience. Since
> > > there's a
> > > > > > fairly concrete set of types in flatbuffers that
> > > > > > we can look at, ideally we can center discussion around the
> details
> > > in
> > > > > the
> > > > > > PR.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > [1]: https://github.com/apache/arrow/pull/10856
> > > > > > [2]:
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13
> > > > > > [3]: https://github.com/apache/arrow/pull/10934
> > > > > >
> > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde <
> > [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > > > Wes wrote:
> > > > > > > >
> > > > > > > > Supporting this kind of intra-application engine
> > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > The data format is the natural interface between tasks.
> (Defining
> > > > > “task”
> > > > > > > here as “something that is programmed using the IR”.) That is
> > > Arrow’s
> > > > > > > strength.
> > > > > > >
> > > > > > > So I think the IR should describe what each task should do, and
> > > tasks
> > > > > > > should be fairly small. Not whole relational operators,
> operating
> > > on
> > > > > whole
> > > > > > > tables, but pieces of relational operators, operating on
> batches
> > or
> > > > > > > sequences of batches.
> > > > > > >
> > > > > > > Elsethread, someone mentioned the LoLePop concept and the
> > > > > Kohn/Leis/Neuman
> > > > > > > paper [1]. The LoLePop concept sounds good for our purposes.
> > > > > > >
> > > > > > > Julian
> > > > > > >
> > > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf
> > > > > > >
> > > > > > >
> > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >>
> > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão <
> > > > > > > >> [email protected]> wrote:
> > > > > > > >>
> > > > > > > >>> Couple of questions
> > > > > > > >>>
> > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given
> > > > > (IR,data), the
> > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same
> for
> > > all
> > > > > > > "engine"?
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> I think that might be a non-starter for mundane reasons:
> > there's
> > > > > > > probably
> > > > > > > >> at least two engines
> > > > > > > >> that disagree on the result of sum(x) where x is a floating
> > > point
> > > > > > > column.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>> 2. if yes, imo we may need to worry about:
> > > > > > > >>> * a definition of equality that implementations agree on.
> > > > > > > >>> * agreement over what the semantics look like. For example,
> > do
> > > we
> > > > > use
> > > > > > > >>> kleene logic for AND and OR?
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> WRT Kleene logic my thoughts are that the IR should support
> > both
> > > > > Kleene
> > > > > > > and
> > > > > > > >> non-Kleene,
> > > > > > > >> and producers can choose their desired semantics.
> > > > > > > >>
> > > > > > > >> Ibis for example, would override the `&` operator in `a & b`
> > to
> > > > > produce
> > > > > > > >> `KleeneAnd(Column(a), Column(b))`.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>>
> > > > > > > >>> To try to understand the gist, let's pick an aggregated
> count
> > > > over
> > > > > a
> > > > > > > >>> column: engines often do partial counts over partitions
> > > followed
> > > > > by a
> > > > > > > final
> > > > > > > >>> "sum" over the partial counts. Is the idea that the query
> > > engine
> > > > > would
> > > > > > > >>> communicate with the compute engine via 2 IRs where one is
> > > "count
> > > > > me
> > > > > > > these"
> > > > > > > >>> the other is "sum me these"?
> > > > > > > >>>
> > > > > > > >>> Best,
> > > > > > > >>> Jorge
> > > > > > > >>>
> > > > > > > >>
> > > > > > > >> Not in its current incarnation.
> > > > > > > >>
> > > > > > > >> The idea is that the IR producer communicates a desire to
> > > count(x)
> > > > > to a
> > > > > > > >> consumer, and  it's up to the consumer to figure out how to
> > turn
> > > > > that
> > > > > > > count
> > > > > > > >> into something that makes sense for itself. In your example
> > > > that's a
> > > > > > > series
> > > > > > > >> of partial counts followed by a sum.
> > > > > > > >>
> > > > > > > >
> > > > > > > > That said, I think there is a valid use case here where a
> > system
> > > > > might
> > > > > > > > make use of different engines to execute different
> (composable)
> > > > > layers
> > > > > > > > of a complex query.
> > > > > > > >
> > > > > > > > For example:
> > > > > > > >
> > > > > > > > * suppose you want to scan and do predicate pushdown on an
> > > unusual
> > > > > > > > data source that is only accessible from one particular
> engine
> > > but
> > > > > > > > * you need to do some analytical operation with the scan
> > results
> > > > that
> > > > > > > > is only supported by another engine
> > > > > > > >
> > > > > > > > You could decompose the query into two stages with an IR
> > > relational
> > > > > > > > expression for each stage and use then the engines together
> to
> > > > > > > > accomplish what you need (of course, you would need an
> > > > orchestration
> > > > > > > > layer to handle plumbing the query engine inputs and outputs
> > > > together
> > > > > > > > as Arrow streams). Supporting this kind of intra-application
> > > engine
> > > > > > > > heterogeneity is one of the motivations for the project.
> > > > > > > >
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <
> > > [email protected]
> > > > >
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Thanks Wes,
> > > > > > > >>>>
> > > > > > > >>>> Great to be back working on Arrow again and engaging with
> > the
> > > > > > > community.
> > > > > > > >>> I
> > > > > > > >>>> am really excited about this effort.
> > > > > > > >>>>
> > > > > > > >>>> I think there are a number of concerns I see as important
> to
> > > > > address
> > > > > > > in
> > > > > > > >>> the
> > > > > > > >>>> compute IR proposal:
> > > > > > > >>>>
> > > > > > > >>>> 1. Requirement for output types.
> > > > > > > >>>>
> > > > > > > >>>> I think that so far there's been many reasons for
> requiring
> > > > > > > conforming IR
> > > > > > > >>>> producers and consumers to adhere to output types, but I
> > > haven't
> > > > > seen
> > > > > > > a
> > > > > > > >>>> strong rationale for keeping them optional (in the
> semantic
> > > > > sense, not
> > > > > > > >>> WRT
> > > > > > > >>>> any particular serialization format's representation of
> > > > > optionality).
> > > > > > > >>>>
> > > > > > > >>>> I think a design that includes unambiguous semantics for
> > > output
> > > > > types
> > > > > > > (a
> > > > > > > >>>> consumer must produce a value of the requested type or
> it's
> > an
> > > > > > > >>>> error/non-conforming) is simpler to reason about for
> > > producers,
> > > > > and
> > > > > > > >>>> provides a strong guarantee for end users (humans or
> > machines
> > > > > > > >>> constructing
> > > > > > > >>>> IR from an API and expecting the thing they ask for back
> > from
> > > > the
> > > > > IR
> > > > > > > >>>> consumer).
> > > > > > > >>>>
> > > > > > > >>>> 2. Flexibility
> > > > > > > >>>>
> > > > > > > >>>> The current PR is currently unable to support what I think
> > are
> > > > > killer
> > > > > > > >>>> features of the IR: custom operators (relational or
> column)
> > > and
> > > > > UDFs.
> > > > > > > In
> > > > > > > >>> my
> > > > > > > >>>> mind, on top of the generalized compute description that
> the
> > > IR
> > > > > > > offers,
> > > > > > > >>> the
> > > > > > > >>>> ability for producers and consumers of IR to extend the IR
> > > > without
> > > > > > > >>> needing
> > > > > > > >>>> to modify Arrow or depend on anything except the format is
> > > > itself
> > > > > > > >>> something
> > > > > > > >>>> that is necessary to gain adoption.
> > > > > > > >>>>
> > > > > > > >>>> Developers will need to build custom relational operators
> > > (e.g.,
> > > > > > > scans of
> > > > > > > >>>> backends that don't exist anywhere for which a user has
> code
> > > to
> > > > > > > >>> implement)
> > > > > > > >>>> and custom functions (anything operating on a column that
> > > > doesn't
> > > > > > > already
> > > > > > > >>>> exist, really). Furthermore, I think we can actually drive
> > > > > building an
> > > > > > > >>>> Arrow consumer using the same API that an end user would
> use
> > > to
> > > > > extend
> > > > > > > >>> the
> > > > > > > >>>> IR.
> > > > > > > >>>>
> > > > > > > >>>> 3. Window Functions
> > > > > > > >>>>
> > > > > > > >>>> Window functions are, I think, an important part of the IR
> > > value
> > > > > > > >>>> proposition, as they are one of the more complex operators
> > in
> > > > > > > databases.
> > > > > > > >>> I
> > > > > > > >>>> think we need to have something in the initial IR proposal
> > to
> > > > > support
> > > > > > > >>> these
> > > > > > > >>>> operations.
> > > > > > > >>>>
> > > > > > > >>>> 4. Non relational Joins
> > > > > > > >>>>
> > > > > > > >>>> Things like as-of join and window join operators aren't
> yet
> > > > > fleshed
> > > > > > > out
> > > > > > > >>> in
> > > > > > > >>>> the IR, and I'm not sure they should be in scope for the
> > > initial
> > > > > > > >>> prototype.
> > > > > > > >>>> I think once we settle on a design, we can work the design
> > of
> > > > > these
> > > > > > > >>>> particular operators out during the initial prototype. I
> > think
> > > > the
> > > > > > > >>>> specification of these operators should basically be PR #2
> > > after
> > > > > the
> > > > > > > >>>> initial design lands.
> > > > > > > >>>>
> > > > > > > >>>> # Order of Work
> > > > > > > >>>>
> > > > > > > >>>> 1. Nail down the design. Anything else is a non-starter.
> > > > > > > >>>>
> > > > > > > >>>> 2. Prototype an IR producer using Ibis
> > > > > > > >>>>
> > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it
> > > has a
> > > > > > > number
> > > > > > > >>> of
> > > > > > > >>>> desirable properties that make prototyping faster and
> allow
> > > for
> > > > > us to
> > > > > > > >>>> refine the design of the IR as needed based on how the
> > > > > implementation
> > > > > > > >>> goes:
> > > > > > > >>>> * It's written in Python so it has native support for
> nearly
> > > all
> > > > > of
> > > > > > > >>>> flatbuffers' features without having to creating bindings
> to
> > > > C++.
> > > > > > > >>>> * There's already a set of rules for type checking, as
> well
> > as
> > > > > APIs
> > > > > > > for
> > > > > > > >>>> constructing expression trees, which means we don't need
> to
> > > > worry
> > > > > > > about
> > > > > > > >>>> building a type checker for the prototype.
> > > > > > > >>>>
> > > > > > > >>>> 3. Prototype an IR consumer in C++
> > > > > > > >>>>
> > > > > > > >>>> I think in parallel to the producer prototype we can
> further
> > > > > inform
> > > > > > > the
> > > > > > > >>>> design from the consumer side by prototyping an IR
> consumer
> > in
> > > > > C++ . I
> > > > > > > >>> know
> > > > > > > >>>> Ben Kietzman has expressed interest in working on this.
> > > > > > > >>>>
> > > > > > > >>>> Very interested to hear others' thoughts.
> > > > > > > >>>>
> > > > > > > >>>> -Phillip
> > > > > > > >>>>
> > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <
> > > > > [email protected]>
> > > > > > > >>> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Thank you for all the feedback and comments on the
> > document.
> > > > I'm
> > > > > on
> > > > > > > >>>>> vacation this week, so I'm delayed responding to
> > everything,
> > > > but
> > > > > I
> > > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in
> > > > > Copenhagen
> > > > > > > >>>>> next week if anyone would like to chat in person about
> it,
> > > and
> > > > > we can
> > > > > > > >>>>> relay the content of any discussions back to the
> > > > > document/PR/e-mail
> > > > > > > >>>>> thread.
> > > > > > > >>>>>
> > > > > > > >>>>> I know that Phillip Cloud expressed interest in working
> on
> > > the
> > > > > PR and
> > > > > > > >>>>> helping work through many of the details, so I'm glad to
> > have
> > > > the
> > > > > > > >>>>> help. If there are others who would like to work on the
> PR
> > or
> > > > dig
> > > > > > > into
> > > > > > > >>>>> the details, please let me know. We might need to figure
> > out
> > > > how
> > > > > to
> > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR
> > project
> > > > > > > somewhere
> > > > > > > >>>>> separate from the format/ directory to permit it to exist
> > in
> > > a
> > > > > > > >>>>> Work-In-Progress status for a period of time until we
> work
> > > > > through
> > > > > > > the
> > > > > > > >>>>> various details and design concerns.
> > > > > > > >>>>>
> > > > > > > >>>>> Re Julian's comment
> > > > > > > >>>>>
> > > > > > > >>>>>> The biggest surprise is that this language does full
> > > > relational
> > > > > > > >>>>> operations. I was expecting that it would do fragments of
> > the
> > > > > > > >>> operations.
> > > > > > > >>>>>
> > > > > > > >>>>> There's a related but different (yet still interesting
> and
> > > > > worthy of
> > > > > > > >>>>> analysis) problem of creating an "engine language" that
> > > > describes
> > > > > > > more
> > > > > > > >>>>> mechanically the constituent parts of implementing the
> > > > relational
> > > > > > > >>>>> operators. To create a functional computation language
> with
> > > > > concrete
> > > > > > > >>>>> Arrow data structures as a top-level primitive sounds
> like
> > an
> > > > > > > >>>>> interesting research area where I could see something
> > > > developing
> > > > > > > >>>>> eventually.
> > > > > > > >>>>>
> > > > > > > >>>>> The main problem I'm interested in solving right now is
> > > > enabling
> > > > > > > front
> > > > > > > >>>>> ends that have sufficient understanding of relational
> > algebra
> > > > and
> > > > > > > data
> > > > > > > >>>>> frame operations to talk to engines without having to go
> > > > > backwards
> > > > > > > >>>>> from their logical query plans to SQL. So as mentioned in
> > the
> > > > > > > >>>>> document, being able to faithfully carry the relational
> > > > operator
> > > > > node
> > > > > > > >>>>> information generated by Calcite or Ibis or another
> system
> > > > would
> > > > > be
> > > > > > > >>>>> super useful. Defining the semantics of various kinds of
> > > > > user-defined
> > > > > > > >>>>> functions would also be helpful to standardize the
> > > > > > > >>>>> engine-to-user-language UDF/extension interface.
> > > > > > > >>>>>
> > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <
> > > > > [email protected]>
> > > > > > > >>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>> Hi Wes,
> > > > > > > >>>>>>
> > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL
> > > > Analytics
> > > > > > > From
> > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 (
> > > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf)
> > > which
> > > > > > > >>>> proposes a
> > > > > > > >>>>>> set of building block for advanced aggregation.
> > > > > > > >>>>>>
> > > > > > > >>>>>> Cheers,
> > > > > > > >>>>>> Dimitri.
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <
> > > > > [email protected]>
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>
> > > > > > > >>>>>>> Wes,
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to
> > the
> > > > PR.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> The biggest surprise is that this language does full
> > > > relational
> > > > > > > >>>>>>> operations. I was expecting that it would do fragments
> of
> > > the
> > > > > > > >>>>> operations.
> > > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to
> > > > > partition
> > > > > > > >>> rows
> > > > > > > >>>>> into
> > > > > > > >>>>>>> output buffers based on a hash key, build hash tables,
> > > probe
> > > > > into
> > > > > > > >>>> hash
> > > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and
> > so
> > > > > forth.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those
> > > operations,
> > > > > > > >>> working
> > > > > > > >>>>> on
> > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches,
> with
> > > > some
> > > > > > > >>> other
> > > > > > > >>>>> system
> > > > > > > >>>>>>> coordinating those tasks. So I would expect to see
> > Arrow’s
> > > > > compute
> > > > > > > >>>>> language
> > > > > > > >>>>>>> mainly operating on batches rather than a table
> > > abstraction.
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Julian
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney <
> > > > [email protected]
> > > > > >
> > > > > > > >>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> hi folks,
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> This idea came up in passing in the past -- given that
> > > there
> > > > > are
> > > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native
> > query
> > > > > > > >>> engines
> > > > > > > >>>>>>>> (and surely many more to come), it seems like it would
> > be
> > > > > > > >>> valuable
> > > > > > > >>>> to
> > > > > > > >>>>>>>> have a way to enable user languages (like Java,
> Python,
> > R,
> > > > or
> > > > > > > >>> Rust,
> > > > > > > >>>>>>>> for example) to communicate with backend computing
> > engines
> > > > > (like
> > > > > > > >>>>>>>> DataFusion, or new computing capabilities being built
> in
> > > the
> > > > > > > >>> Arrow
> > > > > > > >>>>> C++
> > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL
> and
> > > > > > > >>>> specialized
> > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a
> > SQL
> > > > > > > >>> parser /
> > > > > > > >>>>>>>> analyzer framework to generate an expression tree of
> > > > > relational
> > > > > > > >>>>>>>> operators and then translate that to an Arrow-native
> > > > > > > >>>> compute-engine's
> > > > > > > >>>>>>>> internal grammer, a user framework could provide the
> > > desired
> > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations
> > directly
> > > > and
> > > > > > > >>> skip
> > > > > > > >>>>>>>> the SQL altogether.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> The idea of creating a "serialized intermediate
> > > > representation
> > > > > > > >>>> (IR)"
> > > > > > > >>>>>>>> for Arrow compute operations would be to serve use
> cases
> > > > large
> > > > > > > >>> and
> > > > > > > >>>>>>>> small -- from the most complex TPC-* or time series
> > > database
> > > > > > > >>> query
> > > > > > > >>>> to
> > > > > > > >>>>>>>> the most simple array predicate/filter sent with an
> RPC
> > > > > request
> > > > > > > >>>> using
> > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and
> > > > > engine-agnostic,
> > > > > > > >>>> with
> > > > > > > >>>>>>>> the only commonality across usages being the Arrow
> > > columnar
> > > > > > > >>> format
> > > > > > > >>>>>>>> (schemas and array types). This would be better than
> > > leaving
> > > > > it
> > > > > > > >>> to
> > > > > > > >>>>>>>> each application to develop its own bespoke expression
> > > > > > > >>>>> representations
> > > > > > > >>>>>>>> for its needs.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a
> brain
> > > > dump
> > > > > RFC
> > > > > > > >>>>>>>> document [1] and accompanying pull request [2] that
> > makes
> > > > the
> > > > > > > >>>>> possibly
> > > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent
> > the
> > > > > > > >>>> serialized
> > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of
> > Flatbuffers
> > > in
> > > > > the
> > > > > > > >>>> RFC
> > > > > > > >>>>>>>> document. This PR is obviously deficient in many
> regards
> > > > > > > >>>> (incomplete,
> > > > > > > >>>>>>>> hacky, or unclear in places), and will need some help
> > from
> > > > > others
> > > > > > > >>>> to
> > > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR
> > > will
> > > > be
> > > > > > > >>>>>>>> necessary to work out many of the low-level details.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Note that this IR is intended to be more of a
> "superset"
> > > > > project
> > > > > > > >>>> than
> > > > > > > >>>>>>>> a "lowest common denominator". So there may be things
> > that
> > > > it
> > > > > > > >>>>> includes
> > > > > > > >>>>>>>> which are only available in some engines (e.g. engines
> > > that
> > > > > have
> > > > > > > >>>>>>>> specialized handling of time series data).
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> As some of my personal motivation for the project,
> > > > concurrent
> > > > > > > >>> with
> > > > > > > >>>>> the
> > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project
> > called
> > > > > Ibis
> > > > > > > >>> [3]
> > > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves
> as
> > a
> > > > > "Python
> > > > > > > >>>>>>>> analytical query IR builder" that is capable of
> > generating
> > > > > most
> > > > > > > >>> of
> > > > > > > >>>>> the
> > > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects
> and
> > > > other
> > > > > > > >>>>> backends
> > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have
> > used
> > > > this
> > > > > > > >>>>> library
> > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to
> be
> > > able
> > > > > to
> > > > > > > >>>>>>>> translate between the in-memory "logical query" data
> > > > > structures
> > > > > > > >>> in
> > > > > > > >>>> a
> > > > > > > >>>>>>>> library like Ibis to a serialized format that can be
> > > > executed
> > > > > by
> > > > > > > >>>> many
> > > > > > > >>>>>>>> different Arrow-native query engines. The expression
> > > > > primitives
> > > > > > > >>>>>>>> available in Ibis should serve as helpful test cases,
> > too.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> I look forward to the community's comments on the RFC
> > > > document
> > > > > > > >>> [1]
> > > > > > > >>>>> and
> > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at
> consensus
> > > on
> > > > a
> > > > > > > >>>> complex
> > > > > > > >>>>>>>> and ambitious project like this can be challenging so
> I
> > > > > recommend
> > > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC
> and
> > > ask
> > > > > > > >>>> questions
> > > > > > > >>>>>>>> if you are unclear about the scope of what problems
> this
> > > is
> > > > > > > >>> trying
> > > > > > > >>>> to
> > > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC
> > > > > document
> > > > > > > >>> to
> > > > > > > >>>>>>>> others and would consider ideas about how to move
> > forward
> > > > with
> > > > > > > >>>>>>>> something that is able to be implemented by different
> > > Arrow
> > > > > > > >>>> libraries
> > > > > > > >>>>>>>> in the reasonably near future.
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> Thanks,
> > > > > > > >>>>>>>> Wes
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> [1]:
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856
> > > > > > > >>>>>>>> [3]: https://ibis-project.org/
> > > > > > > >>>>>>>> [4]:
> > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > > > > > > >>>>>>>> [5]:
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>
> > > > > > > >>>>>
> > > > > > > >>>>
> > > > > > > >>>
> > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to