bkietz commented on a change in pull request #10934: URL: https://github.com/apache/arrow/pull/10934#discussion_r690584137
########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + impl: ExpressionImpl (required); Review comment: The generators for some languages don't support vector-of-unions. I'll add a comment explaining that ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + impl: ExpressionImpl (required); +} + +union Shape { + Array, Scalar +} + +table Scalar {} + +table Array { + /// Number of slots. + length: long; +} + +table Literal { + /// Shape of this literal. + /// + /// Note that this is orthogonal to type and refers to the number + /// of rows spanned by this Literal - a Literal may be Scalar shaped + /// with multiple "columns" if the type happens to be Struct. + shape: Shape (required); + + /// The type of this literal. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field (required); + + /// Buffers containing N elements of arrow-formatted data, where N + /// is Array.length if shape is Array or 1 if shape is Scalar. + /// XXX this can be optimized for trivial scalars later + buffers: [InlineBuffer]; + + /// If (and only if) this Literal has dictionary type, this field dictionary + /// into which the literal's indices refer. + dictionary: Literal; +} + +table FieldRef { + /// A sequence of field names to allow referencing potentially nested fields + path: [string]; + + /// For Expressions which might reference fields in multiple Relations, + /// this index may be provided to indicate which Relation's fields + /// `path` points into. For example in the case of a join, + /// 0 refers to the left relation and 1 to the right relation. + relation_index: int; + + /// The type of the referenced Field. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field; +} + +table Call { + /// The namespaced name of the function whose invocation this Call represents. + /// For example: "arrow::add" or "gandiva::jit_3432". + /// + /// Names with no namespace are reserved for canonicalization. + function_name: string (required); + + /// Parameters for `function_name`; content/format may be unique to each + /// value of `function_name`. + options: InlineBuffer; + + /// The arguments passed to `function_name`. + arguments: [Expression] (required); + + /// The type of data which invoking `function_name` will return. + /// Field is used instead of Type to pick up child fields, + /// dictionary encoding, etc. + field: Field; +} + +/// A relation is a set of rows with consistent schema. +table Relation { + /// The namespaced name of this Relation. + /// For example: "arrow::hash_join" or "gandiva::filter_and_project". + /// + /// Names with no namespace are reserved for canonical, "pure" relational + /// algebraic operations, which currently include: + /// "filter" + /// "project" + /// "aggregate" + /// "join" + /// "order_by" + /// "limit" + /// "common" + /// "union" + /// "literal" + /// "interactive_output" + relation_name: string (required); + + /// Parameters for `relation_name`; content/format may be unique to each + /// value of `relation_name`. + options: InlineBuffer; + + /// The arguments passed to `relation_name`. + arguments: [Relation] (required); + + /// The schema of rows in this Relation + schema: Schema; +} + +/// The contents of Relation.options will be FilterOptions +/// if Relation.relation_name = "filter" +table FilterOptions { + /// The expression which will be evaluated against input rows + /// to determine whether they should be excluded from the + /// "filter" relation's output. + filter_expression: Expression (required); +} + +/// The contents of Relation.options will be ProjectOptions +/// if Relation.relation_name = "project" +table ProjectOptions { + /// Expressions which will be evaluated to produce to + /// the rows of the "project" relation's output. + expressions: [Expression] (required); +} + +/// The contents of Relation.options will be AggregateOptions +/// if Relation.relation_name = "aggregate" +table AggregateOptions { + /// Expressions which will be evaluated to produce to + /// the rows of the "aggregate" relation's output. + aggregations: [Expression] (required); + /// Keys by which `aggregations` will be grouped. + keys: [Expression] (required); +} + +/// The contents of Relation.options will be JoinOptions +/// if Relation.relation_name = "join" +table JoinOptions { + /// The expression which will be evaluated against rows from each + /// input to determine whether they should be included in the + /// "join" relation's output. + on_expression: Expression (required); + /// The namespaced name of the join to use. Non-namespaced names are + /// reserved for canonicalization. Current names include: + /// "inner" + /// "left" + /// "right" + /// "outer" + /// "cross" + join_name: string; +} + +/// Whether lesser values should precede greater or vice versa, +/// also whether nulls should preced or follow values. +enum Ordering : uint8 { + ASCENDING_THEN_NULLS, + DESCENDING_THEN_NULLS, + NULLS_THEN_ASCENDING, + NULLS_THEN_DESCENDING +} + +table SortKey { + value: Expression (required); + ordering: Ordering = ASCENDING_THEN_NULLS; +} + +/// The contents of Relation.options will be OrderByOptions +/// if Relation.relation_name = "order_by" +table OrderByOptions { + /// Define sort order for rows of output. + /// Keys with higher precedence are ordered ahead of other keys. + keys: [SortKey] (required); +} + +/// The contents of Relation.options will be LimitOptions +/// if Relation.relation_name = "limit" +table LimitOptions { + /// Set the maximum number of rows of output. + count: long; +} + +/// The contents of Relation.options will be CommonOptions +/// if Relation.relation_name = "common" +table CommonOptions { + /// Commons (CTEs in SQL) allow assigning a name to a stream Review comment: I'll expand the comment ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); Review comment: I think this can be resolved by taking a leaf out of parquet2's book and defining InlineBuffer as a union of vectors of each primitive type, I'll try that. ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,348 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + // Ideally we'd simply have `union Expression { Literal, FieldRef, Call }` + // but not all generators support vectors of unions so we provide minimal + // indirection to support them. + impl: ExpressionImpl (required); +} + +union Shape { + Array, Scalar +} + +table Scalar {} + +table Array { + /// Number of slots. + length: long; +} + +table Literal { + /// Shape of this literal. + /// + /// Note that this is orthogonal to type and refers to the number + /// of rows spanned by this Literal - a Literal may be Scalar shaped + /// with multiple "columns" if the type happens to be Struct. + shape: Shape (required); + + /// The type of this literal. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field (required); Review comment: SGTM ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,348 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + // Ideally we'd simply have `union Expression { Literal, FieldRef, Call }` + // but not all generators support vectors of unions so we provide minimal + // indirection to support them. + impl: ExpressionImpl (required); +} + +union Shape { + Array, Scalar +} + +table Scalar {} + +table Array { + /// Number of slots. + length: long; +} + +table Literal { + /// Shape of this literal. + /// + /// Note that this is orthogonal to type and refers to the number + /// of rows spanned by this Literal - a Literal may be Scalar shaped + /// with multiple "columns" if the type happens to be Struct. + shape: Shape (required); + + /// The type of this literal. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field (required); + + /// Buffers containing N elements of arrow-formatted data, where N + /// is Array.length if shape is Array or 1 if shape is Scalar. + /// XXX this can be optimized for trivial scalars later + buffers: [InlineBuffer]; + + /// If (and only if) this Literal has dictionary type, this field dictionary + /// into which the literal's indices refer. + dictionary: Literal; +} + +table FieldRef { + /// A sequence of field names to allow referencing potentially nested fields + path: [string]; + + /// For Expressions which might reference fields in multiple Relations, + /// this index may be provided to indicate which Relation's fields + /// `path` points into. For example in the case of a join, + /// 0 refers to the left relation and 1 to the right relation. + relation_index: int; + + /// The type of the referenced Field. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field; +} + +/// A canonical (probably SQL equivalent) function +enum CanonicalFunctionId : uint32 { + // logical + And, + Not, + Or, + + // arithmetic + Add, + Subtract, + Multiply, + Divide, + Power, + AbsoluteValue, + Negate, + Sign, + + // comparison + Equal, + NotEqual, + Greater, + GreaterOrEqual, + Less, + LessOrEqual, + + // aggregations + All, + Any, + Count, + Mean, + Min, + Max, + Mode, + Product, + Sum, + Tdigest, + Quantile, + Variance, + StandardDeviation, +} + +table CanonicalFunction { + id: CanonicalFunctionId; +} + +table NonCanonicalFunction { + name_space: string (required); + name: string (required); +} + +union Function { + CanonicalFunction, NonCanonicalFunction Review comment: I'm not sure how that's distinct from providing a NonCanonicalFunction with the serialized function in the options blob ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,348 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + // Ideally we'd simply have `union Expression { Literal, FieldRef, Call }` + // but not all generators support vectors of unions so we provide minimal + // indirection to support them. + impl: ExpressionImpl (required); +} + +union Shape { + Array, Scalar +} + +table Scalar {} + +table Array { + /// Number of slots. + length: long; +} + +table Literal { + /// Shape of this literal. + /// + /// Note that this is orthogonal to type and refers to the number + /// of rows spanned by this Literal - a Literal may be Scalar shaped + /// with multiple "columns" if the type happens to be Struct. + shape: Shape (required); + + /// The type of this literal. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field (required); + + /// Buffers containing N elements of arrow-formatted data, where N + /// is Array.length if shape is Array or 1 if shape is Scalar. + /// XXX this can be optimized for trivial scalars later + buffers: [InlineBuffer]; + + /// If (and only if) this Literal has dictionary type, this field dictionary + /// into which the literal's indices refer. + dictionary: Literal; +} + +table FieldRef { + /// A sequence of field names to allow referencing potentially nested fields + path: [string]; + + /// For Expressions which might reference fields in multiple Relations, + /// this index may be provided to indicate which Relation's fields + /// `path` points into. For example in the case of a join, + /// 0 refers to the left relation and 1 to the right relation. + relation_index: int; + + /// The type of the referenced Field. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field; +} + +/// A canonical (probably SQL equivalent) function +enum CanonicalFunctionId : uint32 { + // logical + And, + Not, + Or, + + // arithmetic + Add, + Subtract, + Multiply, + Divide, + Power, + AbsoluteValue, + Negate, + Sign, + + // comparison + Equal, + NotEqual, + Greater, + GreaterOrEqual, + Less, + LessOrEqual, + + // aggregations + All, + Any, + Count, + Mean, + Min, + Max, + Mode, + Product, + Sum, + Tdigest, + Quantile, + Variance, + StandardDeviation, +} + +table CanonicalFunction { + id: CanonicalFunctionId; +} + +table NonCanonicalFunction { + name_space: string (required); + name: string (required); +} + +union Function { + CanonicalFunction, NonCanonicalFunction +} + +table Call { + /// The function whose invocation this Call represents. + function: Function (required); + + /// Parameters for `function_name`; content/format may be unique to each + /// value of `function_name`. + options: InlineBuffer; + + /// The arguments passed to `function_name`. + arguments: [Expression] (required); + + /// The type of data which invoking `function_name` will return. + /// Field is used instead of Type to pick up child fields, + /// dictionary encoding, etc. + field: Field; +} + +enum CanonicalOperationId : uint32 { + Literal, + Filter, + Project, + Aggregate, + Join, + OrderBy, + Limit, + Common, + Union, + InteractiveOutput, +} + +table CanonicalOperation { + id: CanonicalOperationId; +} + +table NonCanonicalOperation { + name_space: string (required); + name: string (required); +} + +union Operation { + CanonicalOperation, NonCanonicalOperation +} + +/// A relation is a set of rows with consistent schema. +table Relation { + /// The operation which this Relation wraps. + operation: Operation (required); + + /// Parameters for `operation`; content/format may be unique to each + /// value of `operation`. + options: InlineBuffer; + + /// The arguments passed to `operation`. + arguments: [Relation] (required); + + /// The schema of rows in this Relation + schema: Schema; +} + +/// The contents of Relation.options will be FilterOptions +/// if Relation.operation = CanonicalOperation::Filter +table FilterOptions { + /// The expression which will be evaluated against input rows + /// to determine whether they should be excluded from the + /// filter relation's output. + filter_expression: Expression (required); +} + +/// The contents of Relation.options will be ProjectOptions +/// if Relation.operation = CanonicalOperation::Project +table ProjectOptions { + /// Expressions which will be evaluated to produce to + /// the rows of the project relation's output. + expressions: [Expression] (required); +} + +/// The contents of Relation.options will be AggregateOptions +/// if Relation.operation = CanonicalOperation::Aggregate +table AggregateOptions { + /// Expressions which will be evaluated to produce to + /// the rows of the aggregate relation's output. + aggregations: [Expression] (required); + /// Keys by which `aggregations` will be grouped. + keys: [Expression] (required); +} + +enum CanonicalJoinKindId : uint32 { + Inner, + LeftOuter, + RightOuter, + FullOuter, + Cross, +} + +table CanonicalJoinKind { + id: CanonicalJoinKindId; +} + +table NonCanonicalJoinKind { + name_space: string (required); + name: string (required); +} + +union JoinKind { + CanonicalJoinKind, NonCanonicalJoinKind +} + +/// The contents of Relation.options will be JoinOptions +/// if Relation.operation = CanonicalOperation::Join +table JoinOptions { + /// The expression which will be evaluated against rows from each + /// input to determine whether they should be included in the + /// join relation's output. + on_expression: Expression (required); + /// The kind of join to use. + join_kind: JoinKind (required); +} + +/// Whether lesser values should precede greater or vice versa, +/// also whether nulls should preced or follow values. +enum Ordering : uint8 { + ASCENDING_THEN_NULLS, + DESCENDING_THEN_NULLS, + NULLS_THEN_ASCENDING, + NULLS_THEN_DESCENDING +} + +table SortKey { + value: Expression (required); + ordering: Ordering = ASCENDING_THEN_NULLS; +} + +/// The contents of Relation.options will be OrderByOptions +/// if Relation.operation = CanonicalOperation::OrderBy +table OrderByOptions { + /// Define sort order for rows of output. + /// Keys with higher precedence are ordered ahead of other keys. + keys: [SortKey] (required); +} + +/// The contents of Relation.options will be LimitOptions +/// if Relation.operation = CanonicalOperation::Limit +table LimitOptions { + /// Set the maximum number of rows of output. + count: long; +} + +/// The contents of Relation.options will be CommonOptions +/// if Relation.operation = CanonicalOperation::Common +table CommonOptions { + /// Commons (CTEs in SQL) allow assigning a name to a stream + /// of data and reusing it, potentially multiple times and + /// potentially recursively. + name: string (required); +} + +/// The contents of Relation.options will be UnionOptions +/// if Relation.operation = CanonicalOperation::Union +table UnionOptions { + /// For simplicity, all rows from any input to a union relation + /// will always be concatenated into a single output- establishing + /// uniqueness of output rows is deferred to other relations. +} + +/// The contents of Relation.options will be LiteralOptions +/// if Relation.operation = CanonicalOperation::Literal +table LiteralOptions { + /// The columns of this literal relation. + columns: [Literal] (required); +} + +/// A specification of a query. +table Plan { + /// One or more output relations. + sinks: [Relation] (required); Review comment: This is a difference in execution model for this proposal (for which `rpc_service Interactive` is provided semi-didactically to clarify). If the root_type is a TableExpr which evaluates to batches, that implies that there is a channel open between the consumer and the producer along which those batches can be returned. This is frequently the case but in general I think we'll want to be able to express execution plans which don't rely on an interactive channel. For example: Plans generated as fragments of larger plans in distributed execution or plans which represent an ETL job. Therefore I think it's preferable that a Plan explicitly include the destination for all batches, even if that will quite commonly be `operation=InteractiveOutput` (just pipe them back to the user). Only a single instance of InteractiveOutput is permitted in a Plan (so no interactive producer will need to deinterleave batches piped back along the interactive channel, which I *think* was your concern here). However any number of other sinks are permitted. For example, a Plan may specify that one set of batches be streamed to `tcp://somehost.com:890` for consumption by a service on that host and an unfiltered superset of those batches cached locally into `file://tmp/cache/my_query` for debugging. ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,348 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + // Ideally we'd simply have `union Expression { Literal, FieldRef, Call }` + // but not all generators support vectors of unions so we provide minimal + // indirection to support them. + impl: ExpressionImpl (required); +} + +union Shape { + Array, Scalar +} + +table Scalar {} + +table Array { + /// Number of slots. + length: long; +} + +table Literal { + /// Shape of this literal. + /// + /// Note that this is orthogonal to type and refers to the number + /// of rows spanned by this Literal - a Literal may be Scalar shaped + /// with multiple "columns" if the type happens to be Struct. + shape: Shape (required); + + /// The type of this literal. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field (required); + + /// Buffers containing N elements of arrow-formatted data, where N + /// is Array.length if shape is Array or 1 if shape is Scalar. + /// XXX this can be optimized for trivial scalars later + buffers: [InlineBuffer]; + + /// If (and only if) this Literal has dictionary type, this field dictionary + /// into which the literal's indices refer. + dictionary: Literal; +} + +table FieldRef { + /// A sequence of field names to allow referencing potentially nested fields + path: [string]; + + /// For Expressions which might reference fields in multiple Relations, + /// this index may be provided to indicate which Relation's fields + /// `path` points into. For example in the case of a join, + /// 0 refers to the left relation and 1 to the right relation. + relation_index: int; + + /// The type of the referenced Field. Field is used instead of Type to pick + /// up child fields, dictionary encoding, etc. + field: Field; +} + +/// A canonical (probably SQL equivalent) function +enum CanonicalFunctionId : uint32 { + // logical + And, + Not, + Or, + + // arithmetic + Add, + Subtract, + Multiply, + Divide, + Power, + AbsoluteValue, + Negate, + Sign, + + // comparison + Equal, + NotEqual, + Greater, + GreaterOrEqual, + Less, + LessOrEqual, + + // aggregations + All, + Any, + Count, + Mean, + Min, + Max, + Mode, + Product, + Sum, + Tdigest, + Quantile, + Variance, + StandardDeviation, +} + +table CanonicalFunction { + id: CanonicalFunctionId; +} + +table NonCanonicalFunction { + name_space: string (required); + name: string (required); +} + +union Function { + CanonicalFunction, NonCanonicalFunction +} + +table Call { + /// The function whose invocation this Call represents. + function: Function (required); + + /// Parameters for `function_name`; content/format may be unique to each + /// value of `function_name`. + options: InlineBuffer; + + /// The arguments passed to `function_name`. + arguments: [Expression] (required); + + /// The type of data which invoking `function_name` will return. + /// Field is used instead of Type to pick up child fields, + /// dictionary encoding, etc. + field: Field; +} + +enum CanonicalOperationId : uint32 { + Literal, + Filter, + Project, + Aggregate, + Join, + OrderBy, + Limit, + Common, + Union, + InteractiveOutput, +} + +table CanonicalOperation { + id: CanonicalOperationId; +} + +table NonCanonicalOperation { + name_space: string (required); + name: string (required); +} + +union Operation { + CanonicalOperation, NonCanonicalOperation +} + +/// A relation is a set of rows with consistent schema. +table Relation { + /// The operation which this Relation wraps. + operation: Operation (required); + + /// Parameters for `operation`; content/format may be unique to each + /// value of `operation`. + options: InlineBuffer; Review comment: -1. This adds yet more special casing for the canonical operations and makes it harder to write generic pattern matching utilities while also giving us another discriminant to query and validate. ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); +} + +/// An expression is one of +/// - a Literal datum +/// - a reference to a Field from a Relation +/// - a call to a named function +/// On evaluation, an Expression will have either array or scalar shape. +union ExpressionImpl { + Literal, FieldRef, Call +} + +table Expression { + impl: ExpressionImpl (required); Review comment: I agree that every Expression needs type, but I disagree that name is meaningful for Expressions. To my mind a name is ascribed to the referent by the referring entity; to speak in graph theory it's an edge property rather than a node property. To give a practical context: in a projection like `SELECT $complicated_expr as A, $complicated_expr as B` it should not be an error to use a single memoized instance of `$complicated_expr`- but including name as an expression property would make these two incompatible. ########## File path: format/ComputeIR.fbs ########## @@ -0,0 +1,267 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + +include "Schema.fbs"; + +namespace org.apache.arrow.flatbuf.computeir; + +/// Avoid use of org.apache.arrow.Buffer because it requires a +/// sidecar block of bytes. +table InlineBuffer { + // ulong is used to guarantee alignment and padding of `bytes` so that flatbuffers + // and other alignment sensitive blobs can be stored here + bytes: [ulong] (required); Review comment: I've tried an approach inspired by `arrow2`'s decision to include the primitive type being stored in buffers (and thus also alignment information) all the way down to `struct Bytes` https://github.com/jorgecarleitao/arrow2/blob/main/src/buffer/bytes.rs#L39 I *think* this guarantees alignment without requiring padding fields or reinterpret_casts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
