Hi Wes, cool initiative! Reminded me of "Building Advanced SQL Analytics From Low-Level Plan Operators" from SIGMOD 2021 ( http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which proposes a set of building block for advanced aggregation.
Cheers, Dimitri. On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com> wrote: > Wes, > > Thanks for this. I’ve added comments to the doc and to the PR. > > The biggest surprise is that this language does full relational > operations. I was expecting that it would do fragments of the operations. > Consider join. A distributed hybrid hash join needs to partition rows into > output buffers based on a hash key, build hash tables, probe into hash > tables, scan hash tables for untouched “outer”rows, and so forth. > > I see Arrow’s compute as delivering each of those operations, working on > perhaps a batch at a time, or a sequence of batches, with some other system > coordinating those tasks. So I would expect to see Arrow’s compute language > mainly operating on batches rather than a table abstraction. > > Julian > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > > hi folks, > > > > This idea came up in passing in the past -- given that there are > > multiple independent efforts to develop Arrow-native query engines > > (and surely many more to come), it seems like it would be valuable to > > have a way to enable user languages (like Java, Python, R, or Rust, > > for example) to communicate with backend computing engines (like > > DataFusion, or new computing capabilities being built in the Arrow C++ > > library) in a fashion that is "lower-level" than SQL and specialized > > to Arrow's type system. So rather than leaving it to a SQL parser / > > analyzer framework to generate an expression tree of relational > > operators and then translate that to an Arrow-native compute-engine's > > internal grammer, a user framework could provide the desired > > Arrow-native expression tree / data manipulations directly and skip > > the SQL altogether. > > > > The idea of creating a "serialized intermediate representation (IR)" > > for Arrow compute operations would be to serve use cases large and > > small -- from the most complex TPC-* or time series database query to > > the most simple array predicate/filter sent with an RPC request using > > Arrow Flight. It is deliberately language- and engine-agnostic, with > > the only commonality across usages being the Arrow columnar format > > (schemas and array types). This would be better than leaving it to > > each application to develop its own bespoke expression representations > > for its needs. > > > > I spent a while thinking about this and wrote up a brain dump RFC > > document [1] and accompanying pull request [2] that makes the possibly > > controversial choice of using Flatbuffers to represent the serialized > > IR. I discuss the rationale for the choice of Flatbuffers in the RFC > > document. This PR is obviously deficient in many regards (incomplete, > > hacky, or unclear in places), and will need some help from others to > > flesh out. I suspect that actually implementing the IR will be > > necessary to work out many of the low-level details. > > > > Note that this IR is intended to be more of a "superset" project than > > a "lowest common denominator". So there may be things that it includes > > which are only available in some engines (e.g. engines that have > > specialized handling of time series data). > > > > As some of my personal motivation for the project, concurrent with the > > genesis of Apache Arrow, I started a Python project called Ibis [3] > > (which is similar to R's dplyr project) which serves as a "Python > > analytical query IR builder" that is capable of generating most of the > > SQL standard, targeting many different SQL dialects and other backends > > (like pandas). Microsoft ([4]) and Google ([5]) have used this library > > as a "many-SQL" middleware. As such, I would like to be able to > > translate between the in-memory "logical query" data structures in a > > library like Ibis to a serialized format that can be executed by many > > different Arrow-native query engines. The expression primitives > > available in Ibis should serve as helpful test cases, too. > > > > I look forward to the community's comments on the RFC document [1] and > > pull request [2] -- I realize that arriving at consensus on a complex > > and ambitious project like this can be challenging so I recommend > > spending time on the "non-goals" section in the RFC and ask questions > > if you are unclear about the scope of what problems this is trying to > > solve. I would be happy to give Edit access on the RFC document to > > others and would consider ideas about how to move forward with > > something that is able to be implemented by different Arrow libraries > > in the reasonably near future. > > > > Thanks, > > Wes > > > > [1]: > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > > [2]: https://github.com/apache/arrow/pull/10856 > > [3]: https://ibis-project.org/ > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > > [5]: > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt > >