Re: Substrait, a new project focused on serialized algebra
Hey Andrew, Sorry for the late reply. For some reason, your email never came through to me and I didn't know about it until someone just referenced it within the apache mail archive. I'll respond to your questions inline. In general, we're experimenting with the GitHub discussions functionality as opposed to using a mailing list [1] so would be great if you also continued discussion there. > This seems like something that would be beneficial if you can get other > projects to buy into it. [xkcd] Completely agree. This project is designed specifically to help other projects work together and will only be impactful if it can be leveraged by several projects. I think we've made good progress thus far with contributions from several key communities (Arrow C++, Datafusion, Iceberg, Presto/Velox, Singlestore and others). The xkcd reference is rough. I agree with the general sentiment but I'm not aware of any initiative that is actually trying to do this cross-project (there are initiatives within a single project that take a very "local" perspective"). To me, the opportunity is about solving for extensibility and a library of common conversions between varying standards. This is much like the LLVM project. There were many people building languages and compilers before LLVM. By building a standardized intermediate representation, LLVM enabled frontends to build stuff independent of the backend targets of those creations. Hopefully the same thing will be true for processing systems with Substrait. > How did you agree to the four indicator implementations? They were a proposed sketch. We've had further discussion about that topic in this discussion topic [2]. Keep in mind that those projects are not fundamental to Substrait but were just used to help guide an initial attempt at coming up with a system-agnostic set of common modern data types. Are those projects committed to making breaking changes to move closer > together? No and Substrait doesn't strive to achieve that. The goal of Substrait is to come up with a framework, a set of common primitives, and a strong extensibility system so systems that have specialized needs can still work with the Substrait format and specification and cross-operate with other systems. We are working on something similar in Apache Beam. Our goal is a > unified model with high level APIs that allow users to switch out > different engines ("runners"), our implementation could be simplified > by a standard like yours. We take plans as SQL from Apache Calcite and > Google ZetaSQL, via a programmatic API, and hopefully other sources in > the future (Dataframes). We execute them on Beam runners (currently as > a Java implementation, possibly Arrow in the future). Eventually we > want these plans to run natively where supported (Flink, Spark or > Google Dataflow). Previously we've focused on the top end (turning > plans into Beam Java pipelines) but we are working to push the > relational metadata down the stack now. We need something that is a > superset of existing implementations, This is exactly why I think there is a need for Substrait. I've talked to people at nearly a dozen different companies in the last few months and they are almost all struggling with these issues in one form or another. People are starting to think about frontend data processing plan producers and backend data processing engines independently in a more mature manner. Without a standardized way to connect the two, we get a lot of independent engines that can't work together (and a lot of dialect variations that can only target a single engine). > you appear more focused on a subset. This is a misinterpretation of the goals of Substrait. While it is true that Substrait will only have a subset of all possible data types, function signatures and relational operations in the project proper, the goals of the extension system are that any project can use additional git-uri based extensions to describe the additional functionality. It could still work but Google's funding to build Relational > Beam is dependent on us providing support for internal use cases, > which means native execution on Dataflow with ZetaSQL. Apache Beam > isn't tied to ZetaSQL but we won't be able to adopt a standard that > prevents us from passing the ZetaSQL tests. The functions are significantly more difficult than the types. Inside > Google, ZetaSQL is the standard and that team has defined a huge set > of test cases[4] for their functions. It took them several years to > get a catalog built and for everyone to standardize on it. I'm not > sure the same thing is possible outside a company with a top down > mandate to unify. My understanding is that they based their function > catalog on the existing implementations that were being unified and > used external references (SQL standard, Postgres) to resolve > conflicts. They have a reference implementation that they consider the > source of truth. This specification is
Re: Substrait, a new project focused on serialized algebra
It looks like this project is just you at the moment? I don't see any pull requests or a mailing list. (I'm not on slack.) This seems like something that would be beneficial if you can get other projects to buy into it. [0] How did you agree to the four indicator implementations? Are those projects committed to making breaking changes to move closer together? We are working on something similar in Apache Beam. Our goal is a unified model with high level APIs that allow users to switch out different engines ("runners"), our implementation could be simplified by a standard like yours. We take plans as SQL from Apache Calcite and Google ZetaSQL, via a programmatic API, and hopefully other sources in the future (Dataframes). We execute them on Beam runners (currently as a Java implementation, possibly Arrow in the future). Eventually we want these plans to run natively where supported (Flink, Spark or Google Dataflow). Previously we've focused on the top end (turning plans into Beam Java pipelines) but we are working to push the relational metadata down the stack now. We need something that is a superset of existing implementations, you appear more focused on a subset. It could still work but Google's funding to build Relational Beam is dependent on us providing support for internal use cases, which means native execution on Dataflow with ZetaSQL. Apache Beam isn't tied to ZetaSQL but we won't be able to adopt a standard that prevents us from passing the ZetaSQL tests. Our progression is very similar to your components list. We've already prototyped most of the functionality inside our SQL codebase. We are now pushing it into Beam core. The type system is mostly complete (Beam Schemas[1]), and we are now adding support for Field References[2]. Our pain points with the type system have been around precision of DECIMAL, FLOAT, DOUBLE, and time types. Different implementations have different levels of precision supported, this results in unexpected behavior on the edge cases. On the numeric types, we've mostly defined those as unlimited precision and added "logical types" (what you call "User Defined Types") with restricted precision. For time types, we've been pushing those into "User Defined Types". We have a catalog of built-in "User Defined Types"[3]. We've effectively decided to provide a superset of building block types. The functions are significantly more difficult than the types. Inside Google, ZetaSQL is the standard and that team has defined a huge set of test cases[4] for their functions. It took them several years to get a catalog built and for everyone to standardize on it. I'm not sure the same thing is possible outside a company with a top down mandate to unify. My understanding is that they based their function catalog on the existing implementations that were being unified and used external references (SQL standard, Postgres) to resolve conflicts. They have a reference implementation that they consider the source of truth. This specification is frequently incompatible with Calcite, particularly around edge cases. Most engines had to make breaking changes to adopt this standard (for a public example, see BigQuery legacy SQL and standard SQL). Have you put any thought into how you plan to define the function catalog? What about validating an implementation's adherence to the standard? How will you handle minor incompatibilities without effectively having a function for each dialect? (TRIM_SPARK, TRIM_TRINO, TRIM_ARROW, TRIM_ICEBERG...) What about functions that aren't in your chosen dialects? Andrew [0] https://xkcd.com/927/ [1] https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java#L419 [2] https://docs.google.com/document/d/1eHSO3aIsAUmiVtfDL-pEFenNBRKt26KkF4dQloMhpBQ/edit [3] https://github.com/apache/beam/tree/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/logicaltypes [4] https://github.com/google/zetasql/tree/master/zetasql/compliance On Wed, Sep 8, 2021 at 8:31 AM Jacques Nadeau wrote: > > Hey all, > > For some time I've been thinking that having a common serialized > representation of query plans would be helpful across multiple related > projects. I started working on something independently in this space > several months ago. Since then, Arrow started exploring "Arrow IR" and > Iceberg was proposing something similar to support a cross-engine > structured view. Given the different veins of interest, I think we should > combine forces on a consolidated consensus-driven solution. > > As I've had more conversations with different people, I've come to the > conclusion that given the complexity of the task and people's > competing priorities, a separate "Switzerland project" is the best way to > find common ground. As such, I've started to sketch out a specification [1] > called Substrait. One of my key goals with this effort is to
Substrait, a new project focused on serialized algebra
Hey all, For some time I've been thinking that having a common serialized representation of query plans would be helpful across multiple related projects. I started working on something independently in this space several months ago. Since then, Arrow started exploring "Arrow IR" and Iceberg was proposing something similar to support a cross-engine structured view. Given the different veins of interest, I think we should combine forces on a consolidated consensus-driven solution. As I've had more conversations with different people, I've come to the conclusion that given the complexity of the task and people's competing priorities, a separate "Switzerland project" is the best way to find common ground. As such, I've started to sketch out a specification [1] called Substrait. One of my key goals with this effort is to expose Calcite functionality to more users and expose alternative ways to encapsulate Calcite functionality as a microservice or series of microservices. For those that are interested, please join the Substrait Slack. My first goal is to come to a consensus on the type system of simple [2], compound [3] and physical [4] types. The general approach I'm proposing: - Use Spark, Trino, Arrow and Iceberg as the four indicators of whether something should be part of the spec. It must exist in at least two systems to be formalized. - Avoid a formal distinction between logical and physical (types, operators, etc) - Lean more towards simple types than compound types when systems generally use only a constrained set of parameters (e.g. timestamp(3) and timestamp(6) as opposed to timestamp(x)). - Provide substantial structured extensibility (avoid black boxes as much as possible) Links for Substrait: Site: https://substrait.io Spec source: https://github.com/substrait-io/substrait/tree/main/site/docs Binary format: https://github.com/substrait-io/substrait/tree/main/binary Would love to hear your thoughts! Jacques [1] https://substrait.io/spec/specification/#components [2] https://substrait.io/types/simple_logical_types/ [3] https://substrait.io/types/compound_logical_types/ [4] https://substrait.io/types/physical_types/