> It'd be more precise to say that any relation that communicates a table to > something that isn't another Substrait relation should have an optional field > to specify the column names.
I think this can be automated, at least in most cases, assuming the (Substrait-consumer-side) reproduced column-names are natural/standardized. For example, in the cast_table example, the user would not need to specify "value1" and "value2" as output field-names for cast_table because these natural names would be the default. Yaron. ________________________________ From: Jeroen van Straten <jeroen.van.stra...@gmail.com> Sent: Wednesday, April 20, 2022 11:54 AM To: dev@arrow.apache.org <dev@arrow.apache.org> Subject: Re: [C++] output field names in Arrow Substrait > > The names at the relation root are a different story, since they specify > > the column names for the produced table, but there's already a mechanism > > for specifying those. > I feel this is a bit limited because I don't think relation roots are the > only thing that produces tables. Ah, good point. Substrait doesn't have intermediate write relations yet (at least not in protobuf), so I didn't think of that. It'd be more precise to say that any relation that communicates a table to something that isn't another Substrait relation should have an optional field to specify the column names. On Wed, 20 Apr 2022 at 17:42, Li Jin <ice.xell...@gmail.com> wrote: > Sorry for the many typos :( > > > That's why I think computing schema during serialization might be a > better solution than going for index based in Arrow. > > To clarify, by this I meant keeping the substrait representation and in > Arrow consumer, compute/construct schema/column names from subtrait field > indices. > > On Wed, Apr 20, 2022 at 11:25 AM Li Jin <ice.xell...@gmail.com> wrote: > > > > I don't think Substratit cannot convince Spark to go index based. > > Sorry I meant " I don't think Substrait can convince Spark to go index > > based" > > > > On Wed, Apr 20, 2022 at 11:24 AM Li Jin <ice.xell...@gmail.com> wrote: > > > >> > A third alternative might be to adjust Arrow to move to field-index > >> based > >> execution instead of storing schemas in its relational operators. > >> > >> This is an interesting idea - I do think there are operators that need > >> column name information (e.g. a "checkpoint" node that writes to > external > >> storage). In this case, it sounds like any operator that requires > knowing > >> the column name needs to store additional information for the column > names > >> in addition to the field index, in its substrait representation. > >> > >> > The names at the relation root are a different story, since they > >> specify the column names for the produced table, but there's already a > >> mechanism for specifying those. > >> > >> I feel this is a bit limited because I don't think relation roots are > the > >> only thing that produces tables. I can think of two examples in Spark > that > >> produce tables and are not a relation root. (1) A checkpoint node that > >> produces a table to write to external storage (2) A Python UDF node that > >> materializes a slice of the table and produces a pd.DataFrame to pass > down > >> to a user function. I think we will see these cases in Arrow as well. I > >> feel it might make sense to have a subclass of the operator that > requires > >> column names in addition to just root rel. > >> > >> *On schema-based vs index-based**:* > >> > >> From personal experience working on Spark it would say that having > schema > >> available in each operator is useful for developers - I often print > >> input/output rows and I feel not having the column names available might > >> put a tax on the developer. (But maybe there are other ways to solve > this). > >> > >> Are there other substrait consumers at the moment and how do they deal > >> with this? For example, if we were to have Spark consumes substrait I'd > >> imagine there would be similar problems because schema is part of Spark > >> operators too and I don't think Substratit cannot convince Spark to go > >> index based. > >> I'd imagine during deserialization of the subtrait plan to Spark it > would > >> need to compute a schema object for each operator. That's why I think > >> computing schema during serialization might be a better solution than > going > >> for index based in Arrow. > >> > >> > >> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <cpcl...@gmail.com> > wrote: > >> > >>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten < > >>> jeroen.van.stra...@gmail.com> > >>> wrote: > >>> > >>> > From a Substrait perspective, it would be up to Ibis to convert the > >>> > column names to the correct indices. This can and should be > transparent > >>> > to the end user; front-ends already need to know what the schemas are > >>> > in order to produce a valid Substrait plan, so Ibis should have all > the > >>> > information it needs to do this conversion. > >>> > > >>> > I would, however, agree that it's inconsistent for Substrait to > require > >>> > that column names are specified for read relations, but not for the > >>> > other relations. AFAICT nothing on the consumer side can or should > make > >>> > use of those names. The only use case I can think of is to aid > >>> > debugging; if that's the only reason they exist, it should indeed be > >>> > possible to specify the names for each relation, but it should be in > >>> > the form of as optional, non-functional hints in RelCommon. The names > >>> > at the relation root are a different story, since they specify the > >>> > column names for the produced table, but there's already a mechanism > >>> > for specifying those. No idea if Arrow and Ibis use it (since it's > >>> > optional), but Substrait supports it. > >>> > > >>> > >>> ReadRels having column names is probably not strictly necessary for all > >>> the > >>> ReadRels, but for NamedTable it's necessary. For example, if I wanted > to > >>> represent a ReadRel independent of storage, that an API could > deserialize > >>> for metadata, and present that metadata to the user then you need > column > >>> names. Ibis uses the column names from ReadRels to turn Substrait back > >>> into > >>> expressions. > >>> > >>> The RelRoot is very much used in ibis. Every table expression compiles > >>> to a > >>> RelRoot. > >>> > >>> > >>> > If there *are* special cases I'm not seeing, where the execution > engine > >>> > really does need the names of intermediate columns, I suppose it > would > >>> > be up to the engine to figure out what they would be. The information > >>> > could also be added via Substrait "advanced extensions," especially > >>> > when a user overrides a name, but it wouldn't be a good idea for the > >>> > consumer to rely on the existance of these extensions, for > >>> > compatibility with other frameworks. > >>> > >>> > >>> > If said special cases turn out to not be so special, a valid case > could > >>> > be made to pull that extension into Substrait itself. If they're only > >>> > attached wherever the user manually overrides a name, I think the > size > >>> > explosion Phillip is talking about would be avoided. I *really* don't > >>> > see why anyone would want to match *generated* names in any > functional > >>> > way; that's a recipe for undefined behavior. > >>> > > >>> > None of Substrait's built-in things make use of column names though, > >>> > which implies that any Arrow execution node that *does* need it > cannot > >>> > be described with Substrait, and thus should never appear in an > >>> > execution plan deserialized from Substrait. If it currently does, > >>> > that's an issue with Arrow's Substrait consumer. If we'd want to > >>> > correctly represent those kinds of nodes anyway, we'd already be > >>> > dealing with an extension relation, to which we could attach whatever > >>> > information we want. > >>> > > >>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <cpcl...@gmail.com> > wrote: > >>> > > >>> > > A third alternative might be to adjust Arrow to move to field-index > >>> based > >>> > > execution instead of storing schemas in its relational operators. > >>> > > > >>> > > I suspect that deserializing the schema on the arrow side is the > >>> right > >>> > > solution in at least the near term. Repeating field names where > they > >>> > aren't > >>> > > strictly necessary has a tendency to enlarge plans, and the effect > >>> gets > >>> > > worse as plans go through transformations. > >>> > > > >>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com> > >>> wrote: > >>> > > > >>> > > > Hi Weston, > >>> > > > > >>> > > > To reiterate, the scenario is not a user writing a Substrait plan > >>> by > >>> > > hand, > >>> > > > but a user writing an Ibis expression and the remaining steps - > >>> > Substrait > >>> > > > compilation, serialization to protobuf, deserialization to an > Arrow > >>> > plan, > >>> > > > and execution of the plan - are done systematically. This issue > is > >>> > > related > >>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons > >>> may > >>> > need > >>> > > > to bear on this issue. > >>> > > > Index based field references should still work here. For > example, > >>> if > >>> > > > I have tables: > >>> > > > > >>> > > > Key | LeftPayload -- Key | RightPayload > >>> > > > > >>> > > > The join expression would be field(0) == field(2) > >>> > > > Right, but the point in this use case is that it would not be > >>> > convenient > >>> > > > for a user writing an Ibis expression to specify each key of each > >>> > table, > >>> > > > especially when plenty of (say, 100) tables are being joined. It > >>> would > >>> > be > >>> > > > much more convenient for the user to specify once the name of the > >>> key > >>> > > that > >>> > > > exists in all tables. This is a use case where the key should > (not > >>> > must) > >>> > > be > >>> > > > specified by name for convenience. > >>> > > > In both of those cases the field names are not part of the plan > >>> itself. > >>> > > > In the second use case, the user writing the ibis expression is > >>> > > specifying > >>> > > > a string-name as a parameter to a relation that would later get > >>> passed > >>> > to > >>> > > > an Arrow execution node in an options instance and used to > >>> dynamically > >>> > > set > >>> > > > up field-names. This string-name does in fact appear explicitly > in > >>> the > >>> > > > Substrait plan, not for convenience but of necessity. What does > not > >>> > > appear > >>> > > > in the Substrait plan are output field-names of intermediate > >>> relations; > >>> > > > this in itself is not a problem, because these field-names can > (in > >>> > > > principle) be recomputed to allow them to be matched to the > >>> dynamically > >>> > > > set-up field-names. But the current implementation of Arrow (in > its > >>> > > > Substrait module) does not do this because the schemata are not > >>> > available > >>> > > > during deserialization, when the option instances (such as > >>> > > > ProjectNodeOptions) that require the schemata are created. > Instead, > >>> > Arrow > >>> > > > (in its Substrait module) ends up with non-natural field-names > like > >>> > > > "FieldPath(1)" that fail to be matched. > >>> > > > > >>> > > > That's why this is not a Substrait specific problem, but one that > >>> is > >>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it > >>> can be > >>> > > > resolved either by changing Substrait to assist Arrow by > >>> specifying the > >>> > > > field-names of intermediate relations in the plan, so that they > are > >>> > > readily > >>> > > > available during deserialization, or by changing Arrow (in its > >>> > Substrait > >>> > > > module) to reproduce the field-names during deserialization, like > >>> by > >>> > > > computing the schemata before options and node instances are > >>> created. > >>> > We > >>> > > > might come up with other solutions in this discussion, of course. > >>> > Either > >>> > > > way, for this second use case, the field-names of intermediate > >>> > relations > >>> > > > must be natural; they cannot be left as something like > >>> "FieldPath(1)". > >>> > > > > >>> > > > > >>> > > > Yaron. > >>> > > > ________________________________ > >>> > > > From: Weston Pace <weston.p...@gmail.com> > >>> > > > Sent: Tuesday, April 19, 2022 7:12 PM > >>> > > > To: dev@arrow.apache.org <dev@arrow.apache.org> > >>> > > > Cc: Li Jin <ice.xell...@gmail.com> > >>> > > > Subject: Re: [C++] output field names in Arrow Substrait > >>> > > > > >>> > > > > However, the problem is there are natural cases in > >>> > > > > which an execution node should or must take in a string-name > >>> > > > > >>> > > > If we can come up with such a case then I agree it would be a > >>> problem > >>> > > > for Substrait's current definition. I don't think we can come up > >>> with > >>> > > > such a case. Every column that can be referenced by name has a > >>> unique > >>> > > > index we could use instead. > >>> > > > > >>> > > > > One use case is an execution node that is joining N > >>> > > > > input tables on a column-name that exists in all of them. > >>> > > > > >>> > > > Index based field references should still work here. For > example, > >>> if > >>> > > > I have tables: > >>> > > > > >>> > > > Key | LeftPayload -- Key | RightPayload > >>> > > > > >>> > > > The join expression would be field(0) == field(2) > >>> > > > > >>> > > > > The execution node would make itself more convenient > >>> > > > > for the user by allowing specifying a string-name than by > >>> > > > > specifying N FieldRef instances (each with the same > >>> > > > > name but on a different input table) like the above > >>> > > > > requirement would force > >>> > > > > >>> > > > Substrait plans aren't normally created by humans. I'm not sure > >>> > > > convenience is a factor here. > >>> > > > > >>> > > > > Another use case for a string-name is when it is used within > >>> > > > > the execution node to dynamically set up field-names, e.g., a > >>> > > > > node that operates on the input's columns whose name starts > >>> > > > > with the given string-name or a node that operates on an input > >>> > > > > column whose name is given as data in another input column. > >>> > > > > >>> > > > In both of those cases the field names are not part of the plan > >>> itself. > >>> > > > > >>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <rt...@hotmail.com> > >>> wrote: > >>> > > > > > >>> > > > > Hi Weston, > >>> > > > > > >>> > > > > Thanks for the quick response. > >>> > > > > I think you might have forgotten the links for [1][2][3] > >>> > > > > Sorry about the confusion; I use these not as references to > >>> links but > >>> > > as > >>> > > > markers of points I make in the beginning that I elaborate on > >>> later, in > >>> > > the > >>> > > > places where I reuse the markers. > >>> > > > > Are you going from Substrait to an Arrow execution plan? > >>> > > > > Yes. > >>> > > > > For Substrait -> Arrow most of our execution nodes should take > >>> in a > >>> > > > FieldRef which can be a name but can also be index-based. So I > >>> wouldn't > >>> > > > expect Substrait's exclusive use of index-based references to be > an > >>> > > issue. > >>> > > > > Yes, I agree this is a design feature of Substrait and that if > >>> the > >>> > > > execution node takes in a FieldRef then there won't be a problem > >>> with > >>> > > > Substrait use of an index-based reference for it. If any > >>> > to-be-developed > >>> > > > execution node must indeed use a FieldRef when taking in a > >>> parameter > >>> > that > >>> > > > refers to columns, then this requirement should be documented for > >>> > > > developers of execution nodes. > >>> > > > > > >>> > > > > However, the problem is there are natural cases in which an > >>> execution > >>> > > > node should or must take in a string-name; these suffer from the > >>> above > >>> > > > design feature due to the non-natural field names that it leads > >>> to. One > >>> > > use > >>> > > > case is an execution node that is joining N input tables on a > >>> > column-name > >>> > > > that exists in all of them. The execution node would make itself > >>> more > >>> > > > convenient for the user by allowing specifying a string-name than > >>> by > >>> > > > specifying N FieldRef instances (each with the same name but on a > >>> > > different > >>> > > > input table) like the above requirement would force. Another use > >>> case > >>> > > for a > >>> > > > string-name is when it is used within the execution node to > >>> dynamically > >>> > > set > >>> > > > up field-names, e.g., a node that operates on the input's columns > >>> whose > >>> > > > name starts with the given string-name or a node that operates on > >>> an > >>> > > input > >>> > > > column whose name is given as data in another input column. > >>> > > > > > >>> > > > > This is the main reason we think there should be appropriate > >>> > (natural) > >>> > > > field names defined for each relation/node. > >>> > > > > Also keep in mind that Substrait isn't exactly a query language > >>> for > >>> > > > user's to be typing by hand. > >>> > > > > Yes, I took that into consideration. Namely, the above > discussion > >>> > > refers > >>> > > > to the scenario of a user writing an Ibis expression and the > >>> remaining > >>> > > > steps - Substrait compilation, serialization to protobuf,, > >>> > > deserialization > >>> > > > to an Arrow plan, and execution of the plan - are done > >>> systematically. > >>> > > > > > >>> > > > > > >>> > > > > Yaron. > >>> > > > > ________________________________ > >>> > > > > From: Weston Pace <weston.p...@gmail.com> > >>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM > >>> > > > > To: dev@arrow.apache.org <dev@arrow.apache.org> > >>> > > > > Cc: Li Jin <ice.xell...@gmail.com> > >>> > > > > Subject: Re: [C++] output field names in Arrow Substrait > >>> > > > > > >>> > > > > Hi Yaron, I think you might have forgotten the links for > >>> [1][2][3] so > >>> > > > > I'm not entirely sure of the context. Are you going from > >>> Substrait > >>> > to > >>> > > > > an Arrow execution plan? Or are you going from an Arrow > >>> execution > >>> > > > > plan to Substrait? > >>> > > > > > >>> > > > > For Substrait -> Arrow most of our execution nodes should take > >>> in a > >>> > > > > FieldRef which can be a name but can also be index-based. So I > >>> > > > > wouldn't expect Substrait's exclusive use of index-based > >>> references > >>> > to > >>> > > > > be an issue. Also keep in mind that Substrait isn't exactly a > >>> query > >>> > > > > language for user's to be typing by hand. So, for example, if > a > >>> user > >>> > > > > wants a join and is using Ibis they could type: > >>> > > > > > >>> > > > > table.semi_join(s, table.x == table.y) > >>> > > > > > >>> > > > > Ibis is then responsible for converting "x" and "y" to the > >>> > appropriate > >>> > > > > indices (and it should have all the information needed to do > so). > >>> > The > >>> > > > > Substrait plan will refer to these nodes by index and the > >>> > > > > corresponding Arrow execution plan will use integer-based > >>> FieldRef. > >>> > > > > > >>> > > > > For Arrow -> Substrait then I agree that this could become a > >>> problem. > >>> > > > > Right now the Arrow -> Substrait path is mainly used for > internal > >>> > > > > testing with the hope that Substrait -> Arrow -> Substrait > should > >>> > > > > generally be possible (with zero loss of information). This > >>> does not > >>> > > > > mean that every Arrow plan will be convertible into Substrait. > >>> That > >>> > > > > is certainly a potential goal, and PRs to add that capability > >>> would > >>> > be > >>> > > > > welcome, but I don't know if anyone working on the > >>> Arrow/Substrait > >>> > > > > integration has that goal in mind. If that is your goal I > might > >>> be > >>> > > > > curious to learn more about your use cases. > >>> > > > > > >>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <rt...@hotmail.com > > > >>> > wrote: > >>> > > > > > > >>> > > > > > Hi, > >>> > > > > > > >>> > > > > > > >>> > > > > > We ran into an issue due to the fact that, for intermediate > >>> > > relations, > >>> > > > Substrait does not automatically compute output field names nor > >>> allows > >>> > > one > >>> > > > to explicitly name output fields [1]. This leads to trouble when > >>> one > >>> > > needs > >>> > > > to refer to these output fields by name [2]. We run into this > >>> trouble > >>> > > when > >>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd). > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > One use case where [1] occurs is: > >>> > > > > > > >>> > > > > > import ibis > >>> > > > > > > >>> > > > > > import pyarrow as pa > >>> > > > > > > >>> > > > > > loaded_table = ibis.local_table(...) > >>> > > > > > > >>> > > > > > cast_table = ibis.mutate(loaded_table, > >>> > > > time=loaded_table.time.cast(pa.int64()) > >>> > > > > > > >>> > > > > > Even if loaded_table has nicely named fields, like "time" and > >>> > > "value1" > >>> > > > and "value2", the resulting cast_table (when deserialized in > >>> Arrow) has > >>> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)" > >>> for all > >>> > > but > >>> > > > the "time" field. I believe that Substrait has all the > information > >>> to > >>> > > > automatically compute the field name "value*" for cast_table, but > >>> it > >>> > > > doesn't do this. Moreover, the caller has to know the schema of > >>> > > > loaded_table in order to set the output field names for > >>> cast_table, and > >>> > > > this is not convenient. When cast_table is an intermediate > relation > >>> > > (i.e., > >>> > > > not the root relation of the plan), Substrait also doesn't allow > >>> the > >>> > > caller > >>> > > > to explicitly name the output fields, and there is no place for > >>> these > >>> > > field > >>> > > > names in the Substrait protobuf. > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > One use case where [2] occurs is in a join-type operation > over > >>> > > > cast_table. The caller would normally like to use a field name > >>> (and not > >>> > > an > >>> > > > index) to refer to the join-key. Even if the caller knows the > field > >>> > name > >>> > > > for the join-key in loaded_table, many field names in cast_table > >>> (when > >>> > > > deserialized in Arrow) are different (and each includes an index > >>> in it) > >>> > > > than those of loaded_table. > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > The case of [3] occurs because Arrow deserializes > >>> ExecNodeOption > >>> > > > instances before it has Schema instances at hand. At this stage, > >>> > without > >>> > > > schemata, Arrow cannot compute field names that should be placed > >>> in a > >>> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions. > >>> > > > Currently, Arrow creates an expression vector for the > >>> > ProjectNodeOptions > >>> > > > instance but leaves the names vector empty, and later defaults > each > >>> > name > >>> > > to > >>> > > > the ToString of each expression. > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > We'd like your input on this issue. Currently, we see two > >>> viable > >>> > ways > >>> > > > to resolve this issue: (1) add output field name for intermediate > >>> > > relations > >>> > > > in the Substrait plan so that Arrow can directly access them, or > >>> (2) > >>> > > > reproduce the field names in Arrow during deserialization by > >>> creating > >>> > > > Schema instances before ExecNodeOptions instances. In any case, > we > >>> > think > >>> > > > there should be appropriate field names defined for each > >>> relation/node. > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > Note that due to this issue, any Arrow execution node that > >>> prompts > >>> > > the > >>> > > > user to refer to fields by name might not play nicely with > >>> Substrait. > >>> > > > > > > >>> > > > > > > >>> > > > > > Cheers, > >>> > > > > > Yaron. > >>> > > > > >>> > > > >>> > > >>> > >> >