On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < jorgecarlei...@gmail.com> wrote:
> Couple of questions > > 1. Is the goal that IRs have equal semantics, i.e. given (IR,data), the > operation "(IR,data) - engine -> result" MUST be the same for all "engine"? > I think that might be a non-starter for mundane reasons: there's probably at least two engines that disagree on the result of sum(x) where x is a floating point column. > 2. if yes, imo we may need to worry about: > * a definition of equality that implementations agree on. > * agreement over what the semantics look like. For example, do we use > kleene logic for AND and OR? > WRT Kleene logic my thoughts are that the IR should support both Kleene and non-Kleene, and producers can choose their desired semantics. Ibis for example, would override the `&` operator in `a & b` to produce `KleeneAnd(Column(a), Column(b))`. > > To try to understand the gist, let's pick an aggregated count over a > column: engines often do partial counts over partitions followed by a final > "sum" over the partial counts. Is the idea that the query engine would > communicate with the compute engine via 2 IRs where one is "count me these" > the other is "sum me these"? > > Best, > Jorge > Not in its current incarnation. The idea is that the IR producer communicates a desire to count(x) to a consumer, and it's up to the consumer to figure out how to turn that count into something that makes sense for itself. In your example that's a series of partial counts followed by a sum. > > > > > > On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud <cpcl...@gmail.com> wrote: > > > Thanks Wes, > > > > Great to be back working on Arrow again and engaging with the community. > I > > am really excited about this effort. > > > > I think there are a number of concerns I see as important to address in > the > > compute IR proposal: > > > > 1. Requirement for output types. > > > > I think that so far there's been many reasons for requiring conforming IR > > producers and consumers to adhere to output types, but I haven't seen a > > strong rationale for keeping them optional (in the semantic sense, not > WRT > > any particular serialization format's representation of optionality). > > > > I think a design that includes unambiguous semantics for output types (a > > consumer must produce a value of the requested type or it's an > > error/non-conforming) is simpler to reason about for producers, and > > provides a strong guarantee for end users (humans or machines > constructing > > IR from an API and expecting the thing they ask for back from the IR > > consumer). > > > > 2. Flexibility > > > > The current PR is currently unable to support what I think are killer > > features of the IR: custom operators (relational or column) and UDFs. In > my > > mind, on top of the generalized compute description that the IR offers, > the > > ability for producers and consumers of IR to extend the IR without > needing > > to modify Arrow or depend on anything except the format is itself > something > > that is necessary to gain adoption. > > > > Developers will need to build custom relational operators (e.g., scans of > > backends that don't exist anywhere for which a user has code to > implement) > > and custom functions (anything operating on a column that doesn't already > > exist, really). Furthermore, I think we can actually drive building an > > Arrow consumer using the same API that an end user would use to extend > the > > IR. > > > > 3. Window Functions > > > > Window functions are, I think, an important part of the IR value > > proposition, as they are one of the more complex operators in databases. > I > > think we need to have something in the initial IR proposal to support > these > > operations. > > > > 4. Non relational Joins > > > > Things like as-of join and window join operators aren't yet fleshed out > in > > the IR, and I'm not sure they should be in scope for the initial > prototype. > > I think once we settle on a design, we can work the design of these > > particular operators out during the initial prototype. I think the > > specification of these operators should basically be PR #2 after the > > initial design lands. > > > > # Order of Work > > > > 1. Nail down the design. Anything else is a non-starter. > > > > 2. Prototype an IR producer using Ibis > > > > Ibis is IMO a good candidate for a first IR producer as it has a number > of > > desirable properties that make prototyping faster and allow for us to > > refine the design of the IR as needed based on how the implementation > goes: > > * It's written in Python so it has native support for nearly all of > > flatbuffers' features without having to creating bindings to C++. > > * There's already a set of rules for type checking, as well as APIs for > > constructing expression trees, which means we don't need to worry about > > building a type checker for the prototype. > > > > 3. Prototype an IR consumer in C++ > > > > I think in parallel to the producer prototype we can further inform the > > design from the consumer side by prototyping an IR consumer in C++ . I > know > > Ben Kietzman has expressed interest in working on this. > > > > Very interested to hear others' thoughts. > > > > -Phillip > > > > On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > Thank you for all the feedback and comments on the document. I'm on > > > vacation this week, so I'm delayed responding to everything, but I > > > will get to it as quickly as I can. I will be at VLDB in Copenhagen > > > next week if anyone would like to chat in person about it, and we can > > > relay the content of any discussions back to the document/PR/e-mail > > > thread. > > > > > > I know that Phillip Cloud expressed interest in working on the PR and > > > helping work through many of the details, so I'm glad to have the > > > help. If there are others who would like to work on the PR or dig into > > > the details, please let me know. We might need to figure out how to > > > accommodate "many cooks" by setting up the ComputeIR project somewhere > > > separate from the format/ directory to permit it to exist in a > > > Work-In-Progress status for a period of time until we work through the > > > various details and design concerns. > > > > > > Re Julian's comment > > > > > > > The biggest surprise is that this language does full relational > > > operations. I was expecting that it would do fragments of the > operations. > > > > > > There's a related but different (yet still interesting and worthy of > > > analysis) problem of creating an "engine language" that describes more > > > mechanically the constituent parts of implementing the relational > > > operators. To create a functional computation language with concrete > > > Arrow data structures as a top-level primitive sounds like an > > > interesting research area where I could see something developing > > > eventually. > > > > > > The main problem I'm interested in solving right now is enabling front > > > ends that have sufficient understanding of relational algebra and data > > > frame operations to talk to engines without having to go backwards > > > from their logical query plans to SQL. So as mentioned in the > > > document, being able to faithfully carry the relational operator node > > > information generated by Calcite or Ibis or another system would be > > > super useful. Defining the semantics of various kinds of user-defined > > > functions would also be helpful to standardize the > > > engine-to-user-language UDF/extension interface. > > > > > > On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona <alen...@gmail.com> > > wrote: > > > > > > > > Hi Wes, > > > > > > > > cool initiative! Reminded me of "Building Advanced SQL Analytics From > > > > Low-Level Plan Operators" from SIGMOD 2021 ( > > > > http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) which > > proposes a > > > > set of building block for advanced aggregation. > > > > > > > > Cheers, > > > > Dimitri. > > > > > > > > On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde <jhyde.apa...@gmail.com> > > > wrote: > > > > > > > > > Wes, > > > > > > > > > > Thanks for this. I’ve added comments to the doc and to the PR. > > > > > > > > > > The biggest surprise is that this language does full relational > > > > > operations. I was expecting that it would do fragments of the > > > operations. > > > > > Consider join. A distributed hybrid hash join needs to partition > rows > > > into > > > > > output buffers based on a hash key, build hash tables, probe into > > hash > > > > > tables, scan hash tables for untouched “outer”rows, and so forth. > > > > > > > > > > I see Arrow’s compute as delivering each of those operations, > working > > > on > > > > > perhaps a batch at a time, or a sequence of batches, with some > other > > > system > > > > > coordinating those tasks. So I would expect to see Arrow’s compute > > > language > > > > > mainly operating on batches rather than a table abstraction. > > > > > > > > > > Julian > > > > > > > > > > > > > > > > On Aug 2, 2021, at 5:16 PM, Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > > > > hi folks, > > > > > > > > > > > > This idea came up in passing in the past -- given that there are > > > > > > multiple independent efforts to develop Arrow-native query > engines > > > > > > (and surely many more to come), it seems like it would be > valuable > > to > > > > > > have a way to enable user languages (like Java, Python, R, or > Rust, > > > > > > for example) to communicate with backend computing engines (like > > > > > > DataFusion, or new computing capabilities being built in the > Arrow > > > C++ > > > > > > library) in a fashion that is "lower-level" than SQL and > > specialized > > > > > > to Arrow's type system. So rather than leaving it to a SQL > parser / > > > > > > analyzer framework to generate an expression tree of relational > > > > > > operators and then translate that to an Arrow-native > > compute-engine's > > > > > > internal grammer, a user framework could provide the desired > > > > > > Arrow-native expression tree / data manipulations directly and > skip > > > > > > the SQL altogether. > > > > > > > > > > > > The idea of creating a "serialized intermediate representation > > (IR)" > > > > > > for Arrow compute operations would be to serve use cases large > and > > > > > > small -- from the most complex TPC-* or time series database > query > > to > > > > > > the most simple array predicate/filter sent with an RPC request > > using > > > > > > Arrow Flight. It is deliberately language- and engine-agnostic, > > with > > > > > > the only commonality across usages being the Arrow columnar > format > > > > > > (schemas and array types). This would be better than leaving it > to > > > > > > each application to develop its own bespoke expression > > > representations > > > > > > for its needs. > > > > > > > > > > > > I spent a while thinking about this and wrote up a brain dump RFC > > > > > > document [1] and accompanying pull request [2] that makes the > > > possibly > > > > > > controversial choice of using Flatbuffers to represent the > > serialized > > > > > > IR. I discuss the rationale for the choice of Flatbuffers in the > > RFC > > > > > > document. This PR is obviously deficient in many regards > > (incomplete, > > > > > > hacky, or unclear in places), and will need some help from others > > to > > > > > > flesh out. I suspect that actually implementing the IR will be > > > > > > necessary to work out many of the low-level details. > > > > > > > > > > > > Note that this IR is intended to be more of a "superset" project > > than > > > > > > a "lowest common denominator". So there may be things that it > > > includes > > > > > > which are only available in some engines (e.g. engines that have > > > > > > specialized handling of time series data). > > > > > > > > > > > > As some of my personal motivation for the project, concurrent > with > > > the > > > > > > genesis of Apache Arrow, I started a Python project called Ibis > [3] > > > > > > (which is similar to R's dplyr project) which serves as a "Python > > > > > > analytical query IR builder" that is capable of generating most > of > > > the > > > > > > SQL standard, targeting many different SQL dialects and other > > > backends > > > > > > (like pandas). Microsoft ([4]) and Google ([5]) have used this > > > library > > > > > > as a "many-SQL" middleware. As such, I would like to be able to > > > > > > translate between the in-memory "logical query" data structures > in > > a > > > > > > library like Ibis to a serialized format that can be executed by > > many > > > > > > different Arrow-native query engines. The expression primitives > > > > > > available in Ibis should serve as helpful test cases, too. > > > > > > > > > > > > I look forward to the community's comments on the RFC document > [1] > > > and > > > > > > pull request [2] -- I realize that arriving at consensus on a > > complex > > > > > > and ambitious project like this can be challenging so I recommend > > > > > > spending time on the "non-goals" section in the RFC and ask > > questions > > > > > > if you are unclear about the scope of what problems this is > trying > > to > > > > > > solve. I would be happy to give Edit access on the RFC document > to > > > > > > others and would consider ideas about how to move forward with > > > > > > something that is able to be implemented by different Arrow > > libraries > > > > > > in the reasonably near future. > > > > > > > > > > > > Thanks, > > > > > > Wes > > > > > > > > > > > > [1]: > > > > > > > > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > > > > > > [2]: https://github.com/apache/arrow/pull/10856 > > > > > > [3]: https://ibis-project.org/ > > > > > > [4]: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > > > > > > [5]: > > > > > > > > > > > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt > > > > > > > > > > > > > > > >