Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Dimitri Vorona Tue, 10 Aug 2021 05:36:41 -0700

Hi Wes,

cool initiative! Reminded me of "Building Advanced SQL Analytics From
Low-Level Plan Operators" from SIGMOD 2021 (
http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a
set of building block for advanced aggregation.


Cheers,
Dimitri.

On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com> wrote:

> Wes,
>
> Thanks for this. I’ve added comments to the doc and to the PR.
>
> The biggest surprise is that this language does full relational
> operations. I was expecting that it would do fragments of the operations.
> Consider join. A distributed hybrid hash join needs to partition rows into
> output buffers based on a hash key, build hash tables, probe into hash
> tables, scan hash tables for untouched “outer”rows, and so forth.
>
> I see Arrow’s compute as delivering each of those operations, working on
> perhaps a batch at a time, or a sequence of batches, with some other system
> coordinating those tasks. So I would expect to see Arrow’s compute language
> mainly operating on batches rather than a table abstraction.
>
> Julian
>
>
> > On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > hi folks,
> >
> > This idea came up in passing in the past -- given that there are
> > multiple independent efforts to develop Arrow-native query engines
> > (and surely many more to come), it seems like it would be valuable to
> > have a way to enable user languages (like Java, Python, R, or Rust,
> > for example) to communicate with backend computing engines (like
> > DataFusion, or new computing capabilities being built in the Arrow C++
> > library) in a fashion that is "lower-level" than SQL and specialized
> > to Arrow's type system. So rather than leaving it to a SQL parser /
> > analyzer framework to generate an expression tree of relational
> > operators and then translate that to an Arrow-native compute-engine's
> > internal grammer, a user framework could provide the desired
> > Arrow-native expression tree / data manipulations directly and skip
> > the SQL altogether.
> >
> > The idea of creating a "serialized intermediate representation (IR)"
> > for Arrow compute operations would be to serve use cases large and
> > small -- from the most complex TPC-* or time series database query to
> > the most simple array predicate/filter sent with an RPC request using
> > Arrow Flight. It is deliberately language- and engine-agnostic, with
> > the only commonality across usages being the Arrow columnar format
> > (schemas and array types). This would be better than leaving it to
> > each application to develop its own bespoke expression representations
> > for its needs.
> >
> > I spent a while thinking about this and wrote up a brain dump RFC
> > document [1] and accompanying pull request [2] that makes the possibly
> > controversial choice of using Flatbuffers to represent the serialized
> > IR. I discuss the rationale for the choice of Flatbuffers in the RFC
> > document. This PR is obviously deficient in many regards (incomplete,
> > hacky, or unclear in places), and will need some help from others to
> > flesh out. I suspect that actually implementing the IR will be
> > necessary to work out many of the low-level details.
> >
> > Note that this IR is intended to be more of a "superset" project than
> > a "lowest common denominator". So there may be things that it includes
> > which are only available in some engines (e.g. engines that have
> > specialized handling of time series data).
> >
> > As some of my personal motivation for the project, concurrent with the
> > genesis of Apache Arrow, I started a Python project called Ibis [3]
> > (which is similar to R's dplyr project) which serves as a "Python
> > analytical query IR builder" that is capable of generating most of the
> > SQL standard, targeting many different SQL dialects and other backends
> > (like pandas). Microsoft ([4]) and Google ([5]) have used this library
> > as a "many-SQL" middleware. As such, I would like to be able to
> > translate between the in-memory "logical query" data structures in a
> > library like Ibis to a serialized format that can be executed by many
> > different Arrow-native query engines. The expression primitives
> > available in Ibis should serve as helpful test cases, too.
> >
> > I look forward to the community's comments on the RFC document [1] and
> > pull request [2] -- I realize that arriving at consensus on a complex
> > and ambitious project like this can be challenging so I recommend
> > spending time on the "non-goals" section in the RFC and ask questions
> > if you are unclear about the scope of what problems this is trying to
> > solve. I would be happy to give Edit access on the RFC document to
> > others and would consider ideas about how to move forward with
> > something that is able to be implemented by different Arrow libraries
> > in the reasonably near future.
> >
> > Thanks,
> > Wes
> >
> > [1]:
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#
> > [2]: https://github.com/apache/arrow/pull/10856
> > [3]: https://ibis-project.org/
> > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
> > [5]:
> https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt
>
>

Re: [DISCUSS] Developing an "Arrow Compute IR [Intermediate Representation]" to decouple language front ends from Arrow-native compute engines

Reply via email to