Hey everyone, There's some interesting discussion around types and where their location is in the current PR [1] (and in fact whether to store them at all).
It would be great to get some community feedback on this [2] part of the PR in particular, because the choice of whether to store types at all has important design consequences. [1]: https://github.com/apache/arrow/pull/10934 [2]: https://github.com/apache/arrow/pull/10934/files#r697025313 On Fri, Aug 27, 2021 at 2:11 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > As an FYI, Iceberg is also considering an IR in relation to view support > [1]. I chimed in and pointed them to this thread and Wes's doc. Phillip > and Jacques chimed in there as well. > > [1] > > https://mail-archives.apache.org/mod_mbox/iceberg-dev/202108.mbox/%3CCAKRVfm6h6WxQtp5fj8Yj8XWR1wFe8VohOkPuoZZGK-UHPhtwjQ%40mail.gmail.com%3E > > On Thu, Aug 26, 2021 at 12:40 PM Phillip Cloud <cpcl...@gmail.com> wrote: > > > Thanks for the feedback Jacques, very helpful. In the latest version of > the > > PR, I've tried to incorporate nearly all of these points. > > > > - I've incorporated most of what you had for dereferencing operations > into > > the PR, and gotten rid of schemas except on Read/Write relations. > > - With respect to properties, I've made a bunch more specific operators, > > and kept user-defined things special case-y. > > - I haven't incorporated anything close to physical-plan things, but I > > think that's a good follow up PR. Having separate representations for > > logical/physical plans seems like a waste of effort ultimately. I think > we > > can find a good balance. > > - Agree on UDF support, I think that will have to evolve as the rest of > the > > spec evolves. UDFs will need language-dedicated effort given the large > > variety of languages that folks will want to use to define functions. > > > > On a separate note, in an effort to move this project forward I'd like to > > do one final round of code review from anyone interested and then merge > the > > PR after that. > > This spec will be unstable for a while, until we can work out all the > > design kinks and edge cases, and I think getting this initial spec in is > > important to start that process. > > > > > > On Mon, Aug 23, 2021 at 1:53 PM Jacques Nadeau <jacq...@apache.org> > wrote: > > > > > In a lucky turn of events, Phillip actually turned out to be in my neck > > of > > > the woods on Friday so we had a chance to sit down and discuss this. To > > > help, I actually shared something I had been working on a few months > ago > > > independently (before this discussion started). > > > > > > For reference: > > > Wes PR: https://github.com/apache/arrow/pull/10856 > > > Ben PR: https://github.com/apache/arrow/pull/10934 > > > Jacques PR: https://github.com/apache/arrow/pull/10979 > > > > > > The high level points of feedback I have are: > > > > > > - Ben PR feels too deconstructed. While I like the elegance and > > > symmetry, I believe this will lead to substantially more work in > > moving > > > from serialization format to something closer to what a system would > > > want > > > to manipulate/consume. The reality is that there are a lot of really > > > known > > > things and specializing the representation for these things will > > > ultimately > > > make things easier to program with without error and easier to > debug. > > > (For > > > example, imagine trying to inspect a plan in a debugging session > with > > > the > > > Ben representation.) We should embrace the known things in the > > > specification. > > > - I believe that it is a mistake for the inner workings of the plan > to > > > ever use field names. Only input (e.g. file read) and Output (e.g. > > > return > > > to user or write to file) need to have field names. For the rest of > > the > > > system, using field ordinals (determinant whether nested or flat) is > > > much > > > cleaner and is how most execution systems work. For example, in > > Impala I > > > believe it is called a slot. As I noted in the PR, Calcite as an > > > example is > > > entirely ordinal based at the algebra level. Rowtypes contain field > > > names > > > but they are actually basically pointless. Field references use > > > RexInputRef > > > with ordinal based and rules around column order output (e.g. what > is > > > the > > > field order of a join output) are determinant and done entirely at > an > > > ordinal level. The only place where names should be used (besides > > > input/output) is in the case of map keys. In that case, the names > are > > > actually data, as opposed to scheme metadata. This is why I propose > a > > > strongly structured dereference operation [1]. > > > - Properties should only be included in the serialization if they > are > > > not easily re-derivable at plan consumption time. For example, > you'll > > > note > > > that I don't store schema information for relational operation. Each > > > function and relational operation should already know how a given > > input > > > is > > > transformed to a given output. Capturing this information in the > > > plan/IR is > > > excessive. In many ways, I compare it to the early use of > VectorLayout > > > [2] > > > in Arrow schema. It may have provided some additional checksum of > the > > > message but ultimately it caused more pain than it was worth (and > thus > > > we > > > removed it before formalizing the specification). For reference, in > > the > > > context of Dremio, we used to actually do this, send schema > > information > > > around for all operations. We removed it because in many cases > > becoming > > > the > > > majority of our internal plan serialization (imagine simple > operations > > > that > > > are carrying 1000s of fields). > > > - I suggest focusing on support for both logical and physical > > > representations. The moment you start talking about optimization > > passes, > > > many of those would probably be better being done at the logical > > level. > > > The > > > overlap is really high. > > > - I think a lot more work must be done before introducing UDFs and > > user > > > defined relational operations. I see one goal being the possibility > of > > > there being three systems: A -> B -> C. A is a IR producer. C is a > IR > > > consumer and B is a IR filter or translator. In this situation, B > > > should be > > > able to operate and do optimizations on a plan even if if there are > > > black > > > box user defined operations. Being able to know the > > > properties-preservation > > > or not of these operations is important to making decisions. For > > > example, > > > does a user defined relational operation maintain sortedness? > > > Distribution? > > > Is a defined UDF idempotent? As such, I think the definition of > those > > > black > > > boxes should be much more structured. For example: it is a python > > > relational operation named X stored in Y that maintains properties > 1,2 > > > and > > > disrupts property 3. Putting just a black box of bytes will > > > substantially > > > reduce the compatibility and extensibility of the ecosystem of tools > > > working against IR. I'd note that I wouldn't expect this to be a > > burden > > > to > > > actual end users. By using sensible defaults, I still would expect > an > > > end > > > user tool to support arbitrary user defined operations. > > > - It might make sense to review the XML representation that Orca > uses > > > [3]. I haven't looked at it recently but they had a strong goal of > > > decoupling for most of its life (to use in both Greenplum and Hawq). > > It > > > could be the most mature/formal serialization of query plans > > publically > > > available. > > > > > > > > > [1] > > > > > > > > > https://github.com/apache/arrow/pull/10979/files#diff-e40fbc40cf7a131efd2cb098444931774cfad046b8665b38452258ffaa2e3423R34 > > > [2] > > > > > > > > > https://github.com/apache/arrow/commit/611a4b951e24f4f967c3d382a2027dc035fc37f0 > > > [3] https://github.com/greenplum-db/gporca > > > > > > > > > On Tue, Aug 17, 2021 at 11:14 AM Phillip Cloud <cpcl...@gmail.com> > > wrote: > > > > > > > On Tue, Aug 17, 2021 at 10:56 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > Looking at Ben's alternate PR [1], having an IR that leans heavily > on > > > > > memory references to an out-of-band data sidecar seems like an > > > > > approach that would substantially ratchet up the implementation > > > > > complexity as producing the IR would then have the level of > > complexity > > > > > of producing the Arrow IPC format — when producing the "root" Plan > > > > > message, you must accumulate a list of the dependent serialized > > > > > submessages, placing the appropriate Buffer memory offset in the > Plan > > > > > message, like we do when producing the RecordBatch.buffers field. > > This > > > > > seems complicated to me as you must devise a custom binary protocol > > to > > > > > concatenate the serialized Plan and the messages it depends on > into a > > > > > single binary payload > > > > > > > > > > <ROOT PLAN> > > > > > <padding> > > > > > <Buffer 0> > > > > > <padding> > > > > > <Buffer 1> > > > > > <padding> > > > > > ... > > > > > <Buffer N - 1> > > > > > > > > > > (one purpose of FlatBufferBuilder is to spare you having to do this > > > > > yourself — some reasons we do it for the Arrow IPC format is > because > > > > > appending Arrow memory buffers directly to a FlatBufferBuilder > would > > > > > be inefficient — internal realloc calls — and Flatbuffers are > limited > > > > > to 2GB. Neither of these things are problems here) > > > > > > > > > > In general, I believe the output created by an IR producer should > be > > a > > > > > single serialized object without any out-of-band data sidecar — > this > > > > > is much simpler for implementers and we can still provide an > "escape > > > > > hatch" for user-defined operators and functions where the required > > > > > function/operator is passed opaquely as an embedded binary data. > > > > > > > > > > > > > > > > The serialization format (whether it is Flatbuffers or JSON, or > > > > > something else) should allow for data memoization, so if there is a > > > > > user-defined operator/function, or a relation that is used multiple > > > > > times throughout the query (potentially with a large schema), then > we > > > > > should ensure that the data need not be duplicated in the > > > > > serialization format unnecessarily — in Flatbuffers, you can > achieve > > > > > this by reusing offsets, but we could devise the data structures to > > > > > make the memoization of "expensive" objects more explicit. > > > > > > > > > > > > > I think this is something that would need to be explicitly encoded in > > > > the structures themselves if it's a hard requirement. I don't think > > this > > > > should block > > > > a prototype producer/consumer. > > > > > > > > Is there something in the second PR/design that precludes the reuse > of > > > > offsets? > > > > To my eye, the flatbuffers offset reuse mechanism works just as well > > > there. > > > > > > > > > > > > > I additionally think that it is important to provide as much > built-in > > > > > support for "canonical" operators/functions (such as the ones > > > > > implemented commonly by SQL engines) as possible, and to liberally > > > > > expand the catalogue of "built-in" capabilities. I would still > prefer > > > > > to have large unions/enums of built-in operators/functions and to > > > > > expand those unions/enums to accommodate new things when it is > > > > > demonstrated that there is a need to standardize things between > > > > > producers/consumers. > > > > > > > > > > > > > I think there's a middle ground where we add a bit of structure > > > (something > > > > like > > > > a descriptor from the first PR) to indicate whether a thing is > built-in > > > vs > > > > user-defined. > > > > It looks like Ben has pushed something like this to his PR. > > > > > > > > With that scheme, we have both flexibility and a small set of special > > > > builtins that make up > > > > a statically typed set for expressions and relational operators. > > > > > > > > I would really like to vet this PR with a prototype this week, > > > > to see whether we need to revisit any major choices. I don't think > > we'll > > > be > > > > able to > > > > anticipate all the consequences until we write some code. > > > > > > > > > > > > > > > > > > One of the beneficial properties of the Union/Enum approach for the > > > > > operator/function catalogues, is that when there are additions to > > > > > those enums, the generated Flatbuffers files will cause many > language > > > > > compilers to warn or error on unhandled enum cases. If all > > > > > function/operator names are strings, then you are essentially > > > > > reimplementing the functionality provided by enums by hand. I > > > > > initially used strings for all function references in my original > > > > > prototype, but I now think that using an enum for "built-ins" would > > be > > > > > superior (because of the code-generated enum interface) and not a > > > > > premature optimization. > > > > > > > > > > [1]: https://github.com/apache/arrow/pull/10934 > > > > > > > > > > On Fri, Aug 13, 2021 at 11:26 PM Phillip Cloud <cpcl...@gmail.com> > > > > wrote: > > > > > > > > > > > > Hey all, > > > > > > > > > > > > Just wanted to give an update on the effort here. > > > > > > > > > > > > Ben Kietzman has created an alternative proposal to the initial > > > design > > > > > [1]. > > > > > > It largely overlaps with the original, but differs in a few > > important > > > > > ways: > > > > > > > > > > > > * A big focus of the design is on flexibility, allowing > producers, > > > > > > consumers and ultimately end users of those systems the ability > to > > > > define > > > > > > custom operations in the graph. > > > > > > * There are very few predefined relational operations (project, > > > filter, > > > > > > join and a handful of others). > > > > > > * There are only 3 types of value expressions: literals, field > > > > > references, > > > > > > and function calls. > > > > > > * The model of evaluation is one that requires a final sink node, > > to > > > > > > indicate where the record batches from child relations end up (a > > > file, > > > > a > > > > > > socket, an in-memory buffer, etc). > > > > > > > > > > > > I've added notes [2] to the original Google doc (under the > > > Alternative > > > > > > Design Notes subheading), and a few pseudocode examples. > > > > > > Unfortunately, these went out of date as soon as Ben pushed the > PR > > > [3], > > > > > so > > > > > > I need to update those to reflect his changes. Regardless, > > > > > > the design is broadly the same, so it should still give a good > > > > indication > > > > > > of the details of the design. > > > > > > > > > > > > There are a decent number of review comments on the original PR > > that > > > I > > > > > plan > > > > > > to port over where they are still relevant. > > > > > > I also plan on adding support for window functions either tonight > > or > > > on > > > > > > Monday. > > > > > > > > > > > > Please review this design at your earliest convenience. Since > > > there's a > > > > > > fairly concrete set of types in flatbuffers that > > > > > > we can look at, ideally we can center discussion around the > details > > > in > > > > > the > > > > > > PR. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > [1]: https://github.com/apache/arrow/pull/10856 > > > > > > [2]: > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.4tfbbtaqzu13 > > > > > > [3]: https://github.com/apache/arrow/pull/10934 > > > > > > > > > > > > On Thu, Aug 12, 2021 at 3:55 PM Julian Hyde < > > jhyde.apa...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > > Wes wrote: > > > > > > > > > > > > > > > > Supporting this kind of intra-application engine > > > > > > > > heterogeneity is one of the motivations for the project. > > > > > > > > > > > > > > +1 > > > > > > > > > > > > > > The data format is the natural interface between tasks. > (Defining > > > > > “task” > > > > > > > here as “something that is programmed using the IR”.) That is > > > Arrow’s > > > > > > > strength. > > > > > > > > > > > > > > So I think the IR should describe what each task should do, and > > > tasks > > > > > > > should be fairly small. Not whole relational operators, > operating > > > on > > > > > whole > > > > > > > tables, but pieces of relational operators, operating on > batches > > or > > > > > > > sequences of batches. > > > > > > > > > > > > > > Elsethread, someone mentioned the LoLePop concept and the > > > > > Kohn/Leis/Neuman > > > > > > > paper [1]. The LoLePop concept sounds good for our purposes. > > > > > > > > > > > > > > Julian > > > > > > > > > > > > > > [1] https://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf > > > > > > > > > > > > > > > > > > > > > > On Aug 12, 2021, at 5:19 AM, Wes McKinney < > wesmck...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > On Wed, Aug 11, 2021 at 11:22 PM Phillip Cloud < > > > cpcl...@gmail.com> > > > > > > > wrote: > > > > > > > >> > > > > > > > >> On Wed, Aug 11, 2021 at 4:48 PM Jorge Cardoso Leitão < > > > > > > > >> jorgecarlei...@gmail.com> wrote: > > > > > > > >> > > > > > > > >>> Couple of questions > > > > > > > >>> > > > > > > > >>> 1. Is the goal that IRs have equal semantics, i.e. given > > > > > (IR,data), the > > > > > > > >>> operation "(IR,data) - engine -> result" MUST be the same > for > > > all > > > > > > > "engine"? > > > > > > > >>> > > > > > > > >> > > > > > > > >> I think that might be a non-starter for mundane reasons: > > there's > > > > > > > probably > > > > > > > >> at least two engines > > > > > > > >> that disagree on the result of sum(x) where x is a floating > > > point > > > > > > > column. > > > > > > > >> > > > > > > > >> > > > > > > > >>> 2. if yes, imo we may need to worry about: > > > > > > > >>> * a definition of equality that implementations agree on. > > > > > > > >>> * agreement over what the semantics look like. For example, > > do > > > we > > > > > use > > > > > > > >>> kleene logic for AND and OR? > > > > > > > >>> > > > > > > > >> > > > > > > > >> WRT Kleene logic my thoughts are that the IR should support > > both > > > > > Kleene > > > > > > > and > > > > > > > >> non-Kleene, > > > > > > > >> and producers can choose their desired semantics. > > > > > > > >> > > > > > > > >> Ibis for example, would override the `&` operator in `a & b` > > to > > > > > produce > > > > > > > >> `KleeneAnd(Column(a), Column(b))`. > > > > > > > >> > > > > > > > >> > > > > > > > >>> > > > > > > > >>> To try to understand the gist, let's pick an aggregated > count > > > > over > > > > > a > > > > > > > >>> column: engines often do partial counts over partitions > > > followed > > > > > by a > > > > > > > final > > > > > > > >>> "sum" over the partial counts. Is the idea that the query > > > engine > > > > > would > > > > > > > >>> communicate with the compute engine via 2 IRs where one is > > > "count > > > > > me > > > > > > > these" > > > > > > > >>> the other is "sum me these"? > > > > > > > >>> > > > > > > > >>> Best, > > > > > > > >>> Jorge > > > > > > > >>> > > > > > > > >> > > > > > > > >> Not in its current incarnation. > > > > > > > >> > > > > > > > >> The idea is that the IR producer communicates a desire to > > > count(x) > > > > > to a > > > > > > > >> consumer, and it's up to the consumer to figure out how to > > turn > > > > > that > > > > > > > count > > > > > > > >> into something that makes sense for itself. In your example > > > > that's a > > > > > > > series > > > > > > > >> of partial counts followed by a sum. > > > > > > > >> > > > > > > > > > > > > > > > > That said, I think there is a valid use case here where a > > system > > > > > might > > > > > > > > make use of different engines to execute different > (composable) > > > > > layers > > > > > > > > of a complex query. > > > > > > > > > > > > > > > > For example: > > > > > > > > > > > > > > > > * suppose you want to scan and do predicate pushdown on an > > > unusual > > > > > > > > data source that is only accessible from one particular > engine > > > but > > > > > > > > * you need to do some analytical operation with the scan > > results > > > > that > > > > > > > > is only supported by another engine > > > > > > > > > > > > > > > > You could decompose the query into two stages with an IR > > > relational > > > > > > > > expression for each stage and use then the engines together > to > > > > > > > > accomplish what you need (of course, you would need an > > > > orchestration > > > > > > > > layer to handle plumbing the query engine inputs and outputs > > > > together > > > > > > > > as Arrow streams). Supporting this kind of intra-application > > > engine > > > > > > > > heterogeneity is one of the motivations for the project. > > > > > > > > > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> On Wed, Aug 11, 2021 at 6:10 PM Phillip Cloud < > > > cpcl...@gmail.com > > > > > > > > > > > > wrote: > > > > > > > >>> > > > > > > > >>>> Thanks Wes, > > > > > > > >>>> > > > > > > > >>>> Great to be back working on Arrow again and engaging with > > the > > > > > > > community. > > > > > > > >>> I > > > > > > > >>>> am really excited about this effort. > > > > > > > >>>> > > > > > > > >>>> I think there are a number of concerns I see as important > to > > > > > address > > > > > > > in > > > > > > > >>> the > > > > > > > >>>> compute IR proposal: > > > > > > > >>>> > > > > > > > >>>> 1. Requirement for output types. > > > > > > > >>>> > > > > > > > >>>> I think that so far there's been many reasons for > requiring > > > > > > > conforming IR > > > > > > > >>>> producers and consumers to adhere to output types, but I > > > haven't > > > > > seen > > > > > > > a > > > > > > > >>>> strong rationale for keeping them optional (in the > semantic > > > > > sense, not > > > > > > > >>> WRT > > > > > > > >>>> any particular serialization format's representation of > > > > > optionality). > > > > > > > >>>> > > > > > > > >>>> I think a design that includes unambiguous semantics for > > > output > > > > > types > > > > > > > (a > > > > > > > >>>> consumer must produce a value of the requested type or > it's > > an > > > > > > > >>>> error/non-conforming) is simpler to reason about for > > > producers, > > > > > and > > > > > > > >>>> provides a strong guarantee for end users (humans or > > machines > > > > > > > >>> constructing > > > > > > > >>>> IR from an API and expecting the thing they ask for back > > from > > > > the > > > > > IR > > > > > > > >>>> consumer). > > > > > > > >>>> > > > > > > > >>>> 2. Flexibility > > > > > > > >>>> > > > > > > > >>>> The current PR is currently unable to support what I think > > are > > > > > killer > > > > > > > >>>> features of the IR: custom operators (relational or > column) > > > and > > > > > UDFs. > > > > > > > In > > > > > > > >>> my > > > > > > > >>>> mind, on top of the generalized compute description that > the > > > IR > > > > > > > offers, > > > > > > > >>> the > > > > > > > >>>> ability for producers and consumers of IR to extend the IR > > > > without > > > > > > > >>> needing > > > > > > > >>>> to modify Arrow or depend on anything except the format is > > > > itself > > > > > > > >>> something > > > > > > > >>>> that is necessary to gain adoption. > > > > > > > >>>> > > > > > > > >>>> Developers will need to build custom relational operators > > > (e.g., > > > > > > > scans of > > > > > > > >>>> backends that don't exist anywhere for which a user has > code > > > to > > > > > > > >>> implement) > > > > > > > >>>> and custom functions (anything operating on a column that > > > > doesn't > > > > > > > already > > > > > > > >>>> exist, really). Furthermore, I think we can actually drive > > > > > building an > > > > > > > >>>> Arrow consumer using the same API that an end user would > use > > > to > > > > > extend > > > > > > > >>> the > > > > > > > >>>> IR. > > > > > > > >>>> > > > > > > > >>>> 3. Window Functions > > > > > > > >>>> > > > > > > > >>>> Window functions are, I think, an important part of the IR > > > value > > > > > > > >>>> proposition, as they are one of the more complex operators > > in > > > > > > > databases. > > > > > > > >>> I > > > > > > > >>>> think we need to have something in the initial IR proposal > > to > > > > > support > > > > > > > >>> these > > > > > > > >>>> operations. > > > > > > > >>>> > > > > > > > >>>> 4. Non relational Joins > > > > > > > >>>> > > > > > > > >>>> Things like as-of join and window join operators aren't > yet > > > > > fleshed > > > > > > > out > > > > > > > >>> in > > > > > > > >>>> the IR, and I'm not sure they should be in scope for the > > > initial > > > > > > > >>> prototype. > > > > > > > >>>> I think once we settle on a design, we can work the design > > of > > > > > these > > > > > > > >>>> particular operators out during the initial prototype. I > > think > > > > the > > > > > > > >>>> specification of these operators should basically be PR #2 > > > after > > > > > the > > > > > > > >>>> initial design lands. > > > > > > > >>>> > > > > > > > >>>> # Order of Work > > > > > > > >>>> > > > > > > > >>>> 1. Nail down the design. Anything else is a non-starter. > > > > > > > >>>> > > > > > > > >>>> 2. Prototype an IR producer using Ibis > > > > > > > >>>> > > > > > > > >>>> Ibis is IMO a good candidate for a first IR producer as it > > > has a > > > > > > > number > > > > > > > >>> of > > > > > > > >>>> desirable properties that make prototyping faster and > allow > > > for > > > > > us to > > > > > > > >>>> refine the design of the IR as needed based on how the > > > > > implementation > > > > > > > >>> goes: > > > > > > > >>>> * It's written in Python so it has native support for > nearly > > > all > > > > > of > > > > > > > >>>> flatbuffers' features without having to creating bindings > to > > > > C++. > > > > > > > >>>> * There's already a set of rules for type checking, as > well > > as > > > > > APIs > > > > > > > for > > > > > > > >>>> constructing expression trees, which means we don't need > to > > > > worry > > > > > > > about > > > > > > > >>>> building a type checker for the prototype. > > > > > > > >>>> > > > > > > > >>>> 3. Prototype an IR consumer in C++ > > > > > > > >>>> > > > > > > > >>>> I think in parallel to the producer prototype we can > further > > > > > inform > > > > > > > the > > > > > > > >>>> design from the consumer side by prototyping an IR > consumer > > in > > > > > C++ . I > > > > > > > >>> know > > > > > > > >>>> Ben Kietzman has expressed interest in working on this. > > > > > > > >>>> > > > > > > > >>>> Very interested to hear others' thoughts. > > > > > > > >>>> > > > > > > > >>>> -Phillip > > > > > > > >>>> > > > > > > > >>>> On Tue, Aug 10, 2021 at 10:56 AM Wes McKinney < > > > > > wesmck...@gmail.com> > > > > > > > >>> wrote: > > > > > > > >>>> > > > > > > > >>>>> Thank you for all the feedback and comments on the > > document. > > > > I'm > > > > > on > > > > > > > >>>>> vacation this week, so I'm delayed responding to > > everything, > > > > but > > > > > I > > > > > > > >>>>> will get to it as quickly as I can. I will be at VLDB in > > > > > Copenhagen > > > > > > > >>>>> next week if anyone would like to chat in person about > it, > > > and > > > > > we can > > > > > > > >>>>> relay the content of any discussions back to the > > > > > document/PR/e-mail > > > > > > > >>>>> thread. > > > > > > > >>>>> > > > > > > > >>>>> I know that Phillip Cloud expressed interest in working > on > > > the > > > > > PR and > > > > > > > >>>>> helping work through many of the details, so I'm glad to > > have > > > > the > > > > > > > >>>>> help. If there are others who would like to work on the > PR > > or > > > > dig > > > > > > > into > > > > > > > >>>>> the details, please let me know. We might need to figure > > out > > > > how > > > > > to > > > > > > > >>>>> accommodate "many cooks" by setting up the ComputeIR > > project > > > > > > > somewhere > > > > > > > >>>>> separate from the format/ directory to permit it to exist > > in > > > a > > > > > > > >>>>> Work-In-Progress status for a period of time until we > work > > > > > through > > > > > > > the > > > > > > > >>>>> various details and design concerns. > > > > > > > >>>>> > > > > > > > >>>>> Re Julian's comment > > > > > > > >>>>> > > > > > > > >>>>>> The biggest surprise is that this language does full > > > > relational > > > > > > > >>>>> operations. I was expecting that it would do fragments of > > the > > > > > > > >>> operations. > > > > > > > >>>>> > > > > > > > >>>>> There's a related but different (yet still interesting > and > > > > > worthy of > > > > > > > >>>>> analysis) problem of creating an "engine language" that > > > > describes > > > > > > > more > > > > > > > >>>>> mechanically the constituent parts of implementing the > > > > relational > > > > > > > >>>>> operators. To create a functional computation language > with > > > > > concrete > > > > > > > >>>>> Arrow data structures as a top-level primitive sounds > like > > an > > > > > > > >>>>> interesting research area where I could see something > > > > developing > > > > > > > >>>>> eventually. > > > > > > > >>>>> > > > > > > > >>>>> The main problem I'm interested in solving right now is > > > > enabling > > > > > > > front > > > > > > > >>>>> ends that have sufficient understanding of relational > > algebra > > > > and > > > > > > > data > > > > > > > >>>>> frame operations to talk to engines without having to go > > > > > backwards > > > > > > > >>>>> from their logical query plans to SQL. So as mentioned in > > the > > > > > > > >>>>> document, being able to faithfully carry the relational > > > > operator > > > > > node > > > > > > > >>>>> information generated by Calcite or Ibis or another > system > > > > would > > > > > be > > > > > > > >>>>> super useful. Defining the semantics of various kinds of > > > > > user-defined > > > > > > > >>>>> functions would also be helpful to standardize the > > > > > > > >>>>> engine-to-user-language UDF/extension interface. > > > > > > > >>>>> > > > > > > > >>>>> On Tue, Aug 10, 2021 at 2:36 PM Dimitri Vorona < > > > > > alen...@gmail.com> > > > > > > > >>>> wrote: > > > > > > > >>>>>> > > > > > > > >>>>>> Hi Wes, > > > > > > > >>>>>> > > > > > > > >>>>>> cool initiative! Reminded me of "Building Advanced SQL > > > > Analytics > > > > > > > From > > > > > > > >>>>>> Low-Level Plan Operators" from SIGMOD 2021 ( > > > > > > > >>>>>> http://db.in.tum.de/~kohn/papers/lolepops-sigmod21.pdf) > > > which > > > > > > > >>>> proposes a > > > > > > > >>>>>> set of building block for advanced aggregation. > > > > > > > >>>>>> > > > > > > > >>>>>> Cheers, > > > > > > > >>>>>> Dimitri. > > > > > > > >>>>>> > > > > > > > >>>>>> On Thu, Aug 5, 2021 at 7:59 PM Julian Hyde < > > > > > jhyde.apa...@gmail.com> > > > > > > > >>>>> wrote: > > > > > > > >>>>>> > > > > > > > >>>>>>> Wes, > > > > > > > >>>>>>> > > > > > > > >>>>>>> Thanks for this. I’ve added comments to the doc and to > > the > > > > PR. > > > > > > > >>>>>>> > > > > > > > >>>>>>> The biggest surprise is that this language does full > > > > relational > > > > > > > >>>>>>> operations. I was expecting that it would do fragments > of > > > the > > > > > > > >>>>> operations. > > > > > > > >>>>>>> Consider join. A distributed hybrid hash join needs to > > > > > partition > > > > > > > >>> rows > > > > > > > >>>>> into > > > > > > > >>>>>>> output buffers based on a hash key, build hash tables, > > > probe > > > > > into > > > > > > > >>>> hash > > > > > > > >>>>>>> tables, scan hash tables for untouched “outer”rows, and > > so > > > > > forth. > > > > > > > >>>>>>> > > > > > > > >>>>>>> I see Arrow’s compute as delivering each of those > > > operations, > > > > > > > >>> working > > > > > > > >>>>> on > > > > > > > >>>>>>> perhaps a batch at a time, or a sequence of batches, > with > > > > some > > > > > > > >>> other > > > > > > > >>>>> system > > > > > > > >>>>>>> coordinating those tasks. So I would expect to see > > Arrow’s > > > > > compute > > > > > > > >>>>> language > > > > > > > >>>>>>> mainly operating on batches rather than a table > > > abstraction. > > > > > > > >>>>>>> > > > > > > > >>>>>>> Julian > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>>>>> On Aug 2, 2021, at 5:16 PM, Wes McKinney < > > > > wesmck...@gmail.com > > > > > > > > > > > > > >>>>> wrote: > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> hi folks, > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> This idea came up in passing in the past -- given that > > > there > > > > > are > > > > > > > >>>>>>>> multiple independent efforts to develop Arrow-native > > query > > > > > > > >>> engines > > > > > > > >>>>>>>> (and surely many more to come), it seems like it would > > be > > > > > > > >>> valuable > > > > > > > >>>> to > > > > > > > >>>>>>>> have a way to enable user languages (like Java, > Python, > > R, > > > > or > > > > > > > >>> Rust, > > > > > > > >>>>>>>> for example) to communicate with backend computing > > engines > > > > > (like > > > > > > > >>>>>>>> DataFusion, or new computing capabilities being built > in > > > the > > > > > > > >>> Arrow > > > > > > > >>>>> C++ > > > > > > > >>>>>>>> library) in a fashion that is "lower-level" than SQL > and > > > > > > > >>>> specialized > > > > > > > >>>>>>>> to Arrow's type system. So rather than leaving it to a > > SQL > > > > > > > >>> parser / > > > > > > > >>>>>>>> analyzer framework to generate an expression tree of > > > > > relational > > > > > > > >>>>>>>> operators and then translate that to an Arrow-native > > > > > > > >>>> compute-engine's > > > > > > > >>>>>>>> internal grammer, a user framework could provide the > > > desired > > > > > > > >>>>>>>> Arrow-native expression tree / data manipulations > > directly > > > > and > > > > > > > >>> skip > > > > > > > >>>>>>>> the SQL altogether. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> The idea of creating a "serialized intermediate > > > > representation > > > > > > > >>>> (IR)" > > > > > > > >>>>>>>> for Arrow compute operations would be to serve use > cases > > > > large > > > > > > > >>> and > > > > > > > >>>>>>>> small -- from the most complex TPC-* or time series > > > database > > > > > > > >>> query > > > > > > > >>>> to > > > > > > > >>>>>>>> the most simple array predicate/filter sent with an > RPC > > > > > request > > > > > > > >>>> using > > > > > > > >>>>>>>> Arrow Flight. It is deliberately language- and > > > > > engine-agnostic, > > > > > > > >>>> with > > > > > > > >>>>>>>> the only commonality across usages being the Arrow > > > columnar > > > > > > > >>> format > > > > > > > >>>>>>>> (schemas and array types). This would be better than > > > leaving > > > > > it > > > > > > > >>> to > > > > > > > >>>>>>>> each application to develop its own bespoke expression > > > > > > > >>>>> representations > > > > > > > >>>>>>>> for its needs. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I spent a while thinking about this and wrote up a > brain > > > > dump > > > > > RFC > > > > > > > >>>>>>>> document [1] and accompanying pull request [2] that > > makes > > > > the > > > > > > > >>>>> possibly > > > > > > > >>>>>>>> controversial choice of using Flatbuffers to represent > > the > > > > > > > >>>> serialized > > > > > > > >>>>>>>> IR. I discuss the rationale for the choice of > > Flatbuffers > > > in > > > > > the > > > > > > > >>>> RFC > > > > > > > >>>>>>>> document. This PR is obviously deficient in many > regards > > > > > > > >>>> (incomplete, > > > > > > > >>>>>>>> hacky, or unclear in places), and will need some help > > from > > > > > others > > > > > > > >>>> to > > > > > > > >>>>>>>> flesh out. I suspect that actually implementing the IR > > > will > > > > be > > > > > > > >>>>>>>> necessary to work out many of the low-level details. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Note that this IR is intended to be more of a > "superset" > > > > > project > > > > > > > >>>> than > > > > > > > >>>>>>>> a "lowest common denominator". So there may be things > > that > > > > it > > > > > > > >>>>> includes > > > > > > > >>>>>>>> which are only available in some engines (e.g. engines > > > that > > > > > have > > > > > > > >>>>>>>> specialized handling of time series data). > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> As some of my personal motivation for the project, > > > > concurrent > > > > > > > >>> with > > > > > > > >>>>> the > > > > > > > >>>>>>>> genesis of Apache Arrow, I started a Python project > > called > > > > > Ibis > > > > > > > >>> [3] > > > > > > > >>>>>>>> (which is similar to R's dplyr project) which serves > as > > a > > > > > "Python > > > > > > > >>>>>>>> analytical query IR builder" that is capable of > > generating > > > > > most > > > > > > > >>> of > > > > > > > >>>>> the > > > > > > > >>>>>>>> SQL standard, targeting many different SQL dialects > and > > > > other > > > > > > > >>>>> backends > > > > > > > >>>>>>>> (like pandas). Microsoft ([4]) and Google ([5]) have > > used > > > > this > > > > > > > >>>>> library > > > > > > > >>>>>>>> as a "many-SQL" middleware. As such, I would like to > be > > > able > > > > > to > > > > > > > >>>>>>>> translate between the in-memory "logical query" data > > > > > structures > > > > > > > >>> in > > > > > > > >>>> a > > > > > > > >>>>>>>> library like Ibis to a serialized format that can be > > > > executed > > > > > by > > > > > > > >>>> many > > > > > > > >>>>>>>> different Arrow-native query engines. The expression > > > > > primitives > > > > > > > >>>>>>>> available in Ibis should serve as helpful test cases, > > too. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> I look forward to the community's comments on the RFC > > > > document > > > > > > > >>> [1] > > > > > > > >>>>> and > > > > > > > >>>>>>>> pull request [2] -- I realize that arriving at > consensus > > > on > > > > a > > > > > > > >>>> complex > > > > > > > >>>>>>>> and ambitious project like this can be challenging so > I > > > > > recommend > > > > > > > >>>>>>>> spending time on the "non-goals" section in the RFC > and > > > ask > > > > > > > >>>> questions > > > > > > > >>>>>>>> if you are unclear about the scope of what problems > this > > > is > > > > > > > >>> trying > > > > > > > >>>> to > > > > > > > >>>>>>>> solve. I would be happy to give Edit access on the RFC > > > > > document > > > > > > > >>> to > > > > > > > >>>>>>>> others and would consider ideas about how to move > > forward > > > > with > > > > > > > >>>>>>>> something that is able to be implemented by different > > > Arrow > > > > > > > >>>> libraries > > > > > > > >>>>>>>> in the reasonably near future. > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> Thanks, > > > > > > > >>>>>>>> Wes > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> [1]: > > > > > > > >>>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit# > > > > > > > >>>>>>>> [2]: https://github.com/apache/arrow/pull/10856 > > > > > > > >>>>>>>> [3]: https://ibis-project.org/ > > > > > > > >>>>>>>> [4]: > > > http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf > > > > > > > >>>>>>>> [5]: > > > > > > > >>>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > https://cloud.google.com/blog/products/databases/automate-data-validation-with-dvt > > > > > > > >>>>>>> > > > > > > > >>>>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > >