Re: [C++] output field names in Arrow Substrait

Yaron Gvili Wed, 20 Apr 2022 09:18:19 -0700

> It'd be more precise to say that any relation that communicates a table to 
> something that isn't another Substrait relation should have an optional field 
> to specify the column names.


I think this can be automated, at least in most cases, assuming the 
(Substrait-consumer-side) reproduced column-names are natural/standardized. For 
example, in the cast_table example, the user would not need to specify "value1" 
and "value2" as output field-names for cast_table because these natural names 
would be the default.


Yaron.
________________________________
From: Jeroen van Straten <[email protected]>
Sent: Wednesday, April 20, 2022 11:54 AM
To: [email protected] <[email protected]>
Subject: Re: [C++] output field names in Arrow Substrait

> > The names at the relation root are a different story, since they specify
> > the column names for the produced table, but there's already a mechanism
> > for specifying those.

> I feel this is a bit limited because I don't think relation roots are the
> only thing that produces tables.

Ah, good point. Substrait doesn't have intermediate write relations yet
(at least not in protobuf), so I didn't think of that. It'd be more
precise to say that any relation that communicates a table to something
that isn't another Substrait relation should have an optional field to
specify the column names.

On Wed, 20 Apr 2022 at 17:42, Li Jin <[email protected]> wrote:

> Sorry for the many typos :(
>
> > That's why I think computing schema during serialization might be a
> better solution than going for index based in Arrow.
>
> To clarify, by this I meant keeping the substrait representation and in
> Arrow consumer, compute/construct schema/column names from subtrait field
> indices.
>
> On Wed, Apr 20, 2022 at 11:25 AM Li Jin <[email protected]> wrote:
>
> > >  I don't think Substratit cannot convince Spark to go index based.
> > Sorry I meant " I don't think Substrait can convince Spark to go index
> > based"
> >
> > On Wed, Apr 20, 2022 at 11:24 AM Li Jin <[email protected]> wrote:
> >
> >> >  A third alternative might be to adjust Arrow to move to field-index
> >> based
> >> execution instead of storing schemas in its relational operators.
> >>
> >> This is an interesting idea - I do think there are operators that need
> >> column name information (e.g. a "checkpoint" node that writes to
> external
> >> storage). In this case, it sounds like any operator that requires
> knowing
> >> the column name needs to store additional information for the column
> names
> >> in addition to the field index, in its substrait representation.
> >>
> >> > The names at the relation root are a different story, since they
> >> specify the column names for the produced table, but there's already a
> >> mechanism for specifying those.
> >>
> >> I feel this is a bit limited because I don't think relation roots are
> the
> >> only thing that produces tables. I can think of two examples in Spark
> that
> >> produce tables and are not a relation root. (1) A checkpoint node that
> >> produces a table to write to external storage (2) A Python UDF node that
> >> materializes a slice of the table and produces a pd.DataFrame to pass
> down
> >> to a user function. I think we will see these cases in Arrow as well. I
> >> feel it might make sense to have a subclass of the operator that
> requires
> >> column names in addition to just root rel.
> >>
> >> *On schema-based vs index-based**:*
> >>
> >> From personal experience working on Spark it would say that having
> schema
> >> available in each operator is useful for developers - I often print
> >> input/output rows and I feel not having the column names available might
> >> put a tax on the developer. (But maybe there are other ways to solve
> this).
> >>
> >> Are there other substrait consumers at the moment and how do they deal
> >> with this? For example, if we were to have Spark consumes substrait I'd
> >> imagine there would be similar problems because schema is part of Spark
> >> operators too and I don't think Substratit cannot convince Spark to go
> >> index based.
> >> I'd imagine during deserialization of the subtrait plan to Spark it
> would
> >> need to compute a schema object for each operator. That's why I think
> >> computing schema during serialization might be a better solution than
> going
> >> for index based in Arrow.
> >>
> >>
> >> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <[email protected]>
> wrote:
> >>
> >>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten <
> >>> [email protected]>
> >>> wrote:
> >>>
> >>> > From a Substrait perspective, it would be up to Ibis to convert the
> >>> > column names to the correct indices. This can and should be
> transparent
> >>> > to the end user; front-ends already need to know what the schemas are
> >>> > in order to produce a valid Substrait plan, so Ibis should have all
> the
> >>> > information it needs to do this conversion.
> >>> >
> >>> > I would, however, agree that it's inconsistent for Substrait to
> require
> >>> > that column names are specified for read relations, but not for the
> >>> > other relations. AFAICT nothing on the consumer side can or should
> make
> >>> > use of those names. The only use case I can think of is to aid
> >>> > debugging; if that's the only reason they exist, it should indeed be
> >>> > possible to specify the names for each relation, but it should be in
> >>> > the form of as optional, non-functional hints in RelCommon. The names
> >>> > at the relation root are a different story, since they specify the
> >>> > column names for the produced table, but there's already a mechanism
> >>> > for specifying those. No idea if Arrow and Ibis use it (since it's
> >>> > optional), but Substrait supports it.
> >>> >
> >>>
> >>> ReadRels having column names is probably not strictly necessary for all
> >>> the
> >>> ReadRels, but for NamedTable it's necessary. For example, if I wanted
> to
> >>> represent a ReadRel independent of storage, that an API could
> deserialize
> >>> for metadata, and present that metadata to the user then you need
> column
> >>> names. Ibis uses the column names from ReadRels to turn Substrait back
> >>> into
> >>> expressions.
> >>>
> >>> The RelRoot is very much used in ibis. Every table expression compiles
> >>> to a
> >>> RelRoot.
> >>>
> >>>
> >>> > If there *are* special cases I'm not seeing, where the execution
> engine
> >>> > really does need the names of intermediate columns, I suppose it
> would
> >>> > be up to the engine to figure out what they would be. The information
> >>> > could also be added via Substrait "advanced extensions," especially
> >>> > when a user overrides a name, but it wouldn't be a good idea for the
> >>> > consumer to rely on the existance of these extensions, for
> >>> > compatibility with other frameworks.
> >>>
> >>>
> >>> > If said special cases turn out to not be so special, a valid case
> could
> >>> > be made to pull that extension into Substrait itself. If they're only
> >>> > attached wherever the user manually overrides a name, I think the
> size
> >>> > explosion Phillip is talking about would be avoided. I *really* don't
> >>> > see why anyone would want to match *generated* names in any
> functional
> >>> > way; that's a recipe for undefined behavior.
> >>> >
> >>> > None of Substrait's built-in things make use of column names though,
> >>> > which implies that any Arrow execution node that *does* need it
> cannot
> >>> > be described with Substrait, and thus should never appear in an
> >>> > execution plan deserialized from Substrait. If it currently does,
> >>> > that's an issue with Arrow's Substrait consumer. If we'd want to
> >>> > correctly represent those kinds of nodes anyway, we'd already be
> >>> > dealing with an extension relation, to which we could attach whatever
> >>> > information we want.
> >>> >
> >>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <[email protected]>
> wrote:
> >>> >
> >>> > > A third alternative might be to adjust Arrow to move to field-index
> >>> based
> >>> > > execution instead of storing schemas in its relational operators.
> >>> > >
> >>> > > I suspect that deserializing the schema on the arrow side is the
> >>> right
> >>> > > solution in at least the near term. Repeating field names where
> they
> >>> > aren't
> >>> > > strictly necessary has a tendency to enlarge plans, and the effect
> >>> gets
> >>> > > worse as plans go through transformations.
> >>> > >
> >>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <[email protected]>
> >>> wrote:
> >>> > >
> >>> > > > Hi Weston,
> >>> > > >
> >>> > > > To reiterate, the scenario is not a user writing a Substrait plan
> >>> by
> >>> > > hand,
> >>> > > > but a user writing an Ibis expression and the remaining steps -
> >>> > Substrait
> >>> > > > compilation, serialization to protobuf, deserialization to an
> Arrow
> >>> > plan,
> >>> > > > and execution of the plan - are done systematically. This issue
> is
> >>> > > related
> >>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple persons
> >>> may
> >>> > need
> >>> > > > to bear on this issue.
> >>> > > > Index based field references should still work here.  For
> example,
> >>> if
> >>> > > > I have tables:
> >>> > > >
> >>> > > > Key | LeftPayload    --    Key | RightPayload
> >>> > > >
> >>> > > > The join expression would be field(0) == field(2)
> >>> > > > Right, but the point in this use case is that it would not be
> >>> > convenient
> >>> > > > for a user writing an Ibis expression to specify each key of each
> >>> > table,
> >>> > > > especially when plenty of (say, 100) tables are being joined. It
> >>> would
> >>> > be
> >>> > > > much more convenient for the user to specify once the name of the
> >>> key
> >>> > > that
> >>> > > > exists in all tables. This is a use case where the key should
> (not
> >>> > must)
> >>> > > be
> >>> > > > specified by name for convenience.
> >>> > > > In both of those cases the field names are not part of the plan
> >>> itself.
> >>> > > > In the second use case, the user writing the ibis expression is
> >>> > > specifying
> >>> > > > a string-name as a parameter to a relation that would later get
> >>> passed
> >>> > to
> >>> > > > an Arrow execution node in an options instance and used to
> >>> dynamically
> >>> > > set
> >>> > > > up field-names. This string-name does in fact appear explicitly
> in
> >>> the
> >>> > > > Substrait plan, not for convenience but of necessity. What does
> not
> >>> > > appear
> >>> > > > in the Substrait plan are output field-names of intermediate
> >>> relations;
> >>> > > > this in itself is not a problem, because these field-names can
> (in
> >>> > > > principle) be recomputed to allow them to be matched to the
> >>> dynamically
> >>> > > > set-up field-names. But the current implementation of Arrow (in
> its
> >>> > > > Substrait module) does not do this because the schemata are not
> >>> > available
> >>> > > > during deserialization, when the option instances (such as
> >>> > > > ProjectNodeOptions) that require the schemata are created.
> Instead,
> >>> > Arrow
> >>> > > > (in its Substrait module) ends up with non-natural field-names
> like
> >>> > > > "FieldPath(1)" that fail to be matched.
> >>> > > >
> >>> > > > That's why this is not a Substrait specific problem, but one that
> >>> is
> >>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We think it
> >>> can be
> >>> > > > resolved either by changing Substrait to assist Arrow by
> >>> specifying the
> >>> > > > field-names of intermediate relations in the plan, so that they
> are
> >>> > > readily
> >>> > > > available during deserialization, or by changing Arrow (in its
> >>> > Substrait
> >>> > > > module) to reproduce the field-names during deserialization, like
> >>> by
> >>> > > > computing the schemata before options and node instances are
> >>> created.
> >>> > We
> >>> > > > might come up with other solutions in this discussion, of course.
> >>> > Either
> >>> > > > way, for this second use case, the field-names of intermediate
> >>> > relations
> >>> > > > must be natural; they cannot be left as something like
> >>> "FieldPath(1)".
> >>> > > >
> >>> > > >
> >>> > > > Yaron.
> >>> > > > ________________________________
> >>> > > > From: Weston Pace <[email protected]>
> >>> > > > Sent: Tuesday, April 19, 2022 7:12 PM
> >>> > > > To: [email protected] <[email protected]>
> >>> > > > Cc: Li Jin <[email protected]>
> >>> > > > Subject: Re: [C++] output field names in Arrow Substrait
> >>> > > >
> >>> > > > > However, the problem is there are natural cases in
> >>> > > > > which an execution node should or must take in a string-name
> >>> > > >
> >>> > > > If we can come up with such a case then I agree it would be a
> >>> problem
> >>> > > > for Substrait's current definition.  I don't think we can come up
> >>> with
> >>> > > > such a case.  Every column that can be referenced by name has a
> >>> unique
> >>> > > > index we could use instead.
> >>> > > >
> >>> > > > > One use case is an execution node that is joining N
> >>> > > > > input tables on a column-name that exists in all of them.
> >>> > > >
> >>> > > > Index based field references should still work here.  For
> example,
> >>> if
> >>> > > > I have tables:
> >>> > > >
> >>> > > > Key | LeftPayload    --    Key | RightPayload
> >>> > > >
> >>> > > > The join expression would be field(0) == field(2)
> >>> > > >
> >>> > > > > The execution node would make itself more convenient
> >>> > > > > for the user by allowing specifying a string-name than by
> >>> > > > > specifying N FieldRef instances (each with the same
> >>> > > > > name but on a different input table) like the above
> >>> > > > > requirement would force
> >>> > > >
> >>> > > > Substrait plans aren't normally created by humans.  I'm not sure
> >>> > > > convenience is a factor here.
> >>> > > >
> >>> > > > > Another use case for a string-name is when it is used within
> >>> > > > > the execution node to dynamically set up field-names, e.g., a
> >>> > > > > node that operates on the input's columns whose name starts
> >>> > > > > with the given string-name or a node that operates on an input
> >>> > > > > column whose name is given as data in another input column.
> >>> > > >
> >>> > > > In both of those cases the field names are not part of the plan
> >>> itself.
> >>> > > >
> >>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <[email protected]>
> >>> wrote:
> >>> > > > >
> >>> > > > > Hi Weston,
> >>> > > > >
> >>> > > > > Thanks for the quick response.
> >>> > > > > I think you might have forgotten the links for [1][2][3]
> >>> > > > > Sorry about the confusion; I use these not as references to
> >>> links but
> >>> > > as
> >>> > > > markers of points I make in the beginning that I elaborate on
> >>> later, in
> >>> > > the
> >>> > > > places where I reuse the markers.
> >>> > > > > Are you going from Substrait to an Arrow execution plan?
> >>> > > > > Yes.
> >>> > > > > For Substrait -> Arrow most of our execution nodes should take
> >>> in a
> >>> > > > FieldRef which can be a name but can also be index-based. So I
> >>> wouldn't
> >>> > > > expect Substrait's exclusive use of index-based references to be
> an
> >>> > > issue.
> >>> > > > > Yes, I agree this is a design feature of Substrait and that if
> >>> the
> >>> > > > execution node takes in a FieldRef then there won't be a problem
> >>> with
> >>> > > > Substrait use of an index-based reference for it. If any
> >>> > to-be-developed
> >>> > > > execution node must indeed use a FieldRef when taking in a
> >>> parameter
> >>> > that
> >>> > > > refers to columns, then this requirement should be documented for
> >>> > > > developers of execution nodes.
> >>> > > > >
> >>> > > > > However, the problem is there are natural cases in which an
> >>> execution
> >>> > > > node should or must take in a string-name; these suffer from the
> >>> above
> >>> > > > design feature due to the non-natural field names that it leads
> >>> to. One
> >>> > > use
> >>> > > > case is an execution node that is joining N input tables on a
> >>> > column-name
> >>> > > > that exists in all of them. The execution node would make itself
> >>> more
> >>> > > > convenient for the user by allowing specifying a string-name than
> >>> by
> >>> > > > specifying N FieldRef instances (each with the same name but on a
> >>> > > different
> >>> > > > input table) like the above requirement would force. Another use
> >>> case
> >>> > > for a
> >>> > > > string-name is when it is used within the execution node to
> >>> dynamically
> >>> > > set
> >>> > > > up field-names, e.g., a node that operates on the input's columns
> >>> whose
> >>> > > > name starts with the given string-name or a node that operates on
> >>> an
> >>> > > input
> >>> > > > column whose name is given as data in another input column.
> >>> > > > >
> >>> > > > > This is the main reason we think there should be appropriate
> >>> > (natural)
> >>> > > > field names defined for each relation/node.
> >>> > > > > Also keep in mind that Substrait isn't exactly a query language
> >>> for
> >>> > > > user's to be typing by hand.
> >>> > > > > Yes, I took that into consideration. Namely, the above
> discussion
> >>> > > refers
> >>> > > > to the scenario of a user writing an Ibis expression and the
> >>> remaining
> >>> > > > steps - Substrait compilation, serialization to protobuf,,
> >>> > > deserialization
> >>> > > > to an Arrow plan, and execution of the plan - are done
> >>> systematically.
> >>> > > > >
> >>> > > > >
> >>> > > > > Yaron.
> >>> > > > > ________________________________
> >>> > > > > From: Weston Pace <[email protected]>
> >>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM
> >>> > > > > To: [email protected] <[email protected]>
> >>> > > > > Cc: Li Jin <[email protected]>
> >>> > > > > Subject: Re: [C++] output field names in Arrow Substrait
> >>> > > > >
> >>> > > > > Hi Yaron, I think you might have forgotten the links for
> >>> [1][2][3] so
> >>> > > > > I'm not entirely sure of the context.  Are you going from
> >>> Substrait
> >>> > to
> >>> > > > > an Arrow execution plan?  Or are you going from an Arrow
> >>> execution
> >>> > > > > plan to Substrait?
> >>> > > > >
> >>> > > > > For Substrait -> Arrow most of our execution nodes should take
> >>> in a
> >>> > > > > FieldRef which can be a name but can also be index-based.  So I
> >>> > > > > wouldn't expect Substrait's exclusive use of index-based
> >>> references
> >>> > to
> >>> > > > > be an issue.  Also keep in mind that Substrait isn't exactly a
> >>> query
> >>> > > > > language for user's to be typing by hand.  So, for example, if
> a
> >>> user
> >>> > > > > wants a join and is using Ibis they could type:
> >>> > > > >
> >>> > > > > table.semi_join(s, table.x == table.y)
> >>> > > > >
> >>> > > > > Ibis is then responsible for converting "x" and "y" to the
> >>> > appropriate
> >>> > > > > indices (and it should have all the information needed to do
> so).
> >>> > The
> >>> > > > > Substrait plan will refer to these nodes by index and the
> >>> > > > > corresponding Arrow execution plan will use integer-based
> >>> FieldRef.
> >>> > > > >
> >>> > > > > For Arrow -> Substrait then I agree that this could become a
> >>> problem.
> >>> > > > > Right now the Arrow -> Substrait path is mainly used for
> internal
> >>> > > > > testing with the hope that Substrait -> Arrow -> Substrait
> should
> >>> > > > > generally be possible (with zero loss of information).  This
> >>> does not
> >>> > > > > mean that every Arrow plan will be convertible into Substrait.
> >>> That
> >>> > > > > is certainly a potential goal, and PRs to add that capability
> >>> would
> >>> > be
> >>> > > > > welcome, but I don't know if anyone working on the
> >>> Arrow/Substrait
> >>> > > > > integration has that goal in mind.  If that is your goal I
> might
> >>> be
> >>> > > > > curious to learn more about your use cases.
> >>> > > > >
> >>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <[email protected]
> >
> >>> > wrote:
> >>> > > > > >
> >>> > > > > > Hi,
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > We ran into an issue due to the fact that, for intermediate
> >>> > > relations,
> >>> > > > Substrait does not automatically compute output field names nor
> >>> allows
> >>> > > one
> >>> > > > to explicitly name output fields [1]. This leads to trouble when
> >>> one
> >>> > > needs
> >>> > > > to refer to these output fields by name [2]. We run into this
> >>> trouble
> >>> > > when
> >>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd).
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > One use case where [1] occurs is:
> >>> > > > > >
> >>> > > > > > import ibis
> >>> > > > > >
> >>> > > > > > import pyarrow as pa
> >>> > > > > >
> >>> > > > > > loaded_table = ibis.local_table(...)
> >>> > > > > >
> >>> > > > > > cast_table = ibis.mutate(loaded_table,
> >>> > > > time=loaded_table.time.cast(pa.int64())
> >>> > > > > >
> >>> > > > > > Even if loaded_table has nicely named fields, like "time" and
> >>> > > "value1"
> >>> > > > and "value2", the resulting cast_table (when deserialized in
> >>> Arrow) has
> >>> > > > inconvenient field names like "FieldPath(1)" and "FieldPath(2)"
> >>> for all
> >>> > > but
> >>> > > > the "time" field. I believe that Substrait has all the
> information
> >>> to
> >>> > > > automatically compute the field name "value*" for cast_table, but
> >>> it
> >>> > > > doesn't do this. Moreover, the caller has to know the schema of
> >>> > > > loaded_table in order to set the output field names for
> >>> cast_table, and
> >>> > > > this is not convenient. When cast_table is an intermediate
> relation
> >>> > > (i.e.,
> >>> > > > not the root relation of the plan), Substrait also doesn't allow
> >>> the
> >>> > > caller
> >>> > > > to explicitly name the output fields, and there is no place for
> >>> these
> >>> > > field
> >>> > > > names in the Substrait protobuf.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > One use case where [2] occurs is in a join-type operation
> over
> >>> > > > cast_table. The caller would normally like to use a field name
> >>> (and not
> >>> > > an
> >>> > > > index) to refer to the join-key. Even if the caller knows the
> field
> >>> > name
> >>> > > > for the join-key in loaded_table, many field names in cast_table
> >>> (when
> >>> > > > deserialized in Arrow) are different (and each includes an index
> >>> in it)
> >>> > > > than those of loaded_table.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > The case of [3] occurs because Arrow deserializes
> >>> ExecNodeOption
> >>> > > > instances before it has Schema instances at hand. At this stage,
> >>> > without
> >>> > > > schemata, Arrow cannot compute field names that should be placed
> >>> in a
> >>> > > > ExecNodeOption that needs them, in particular ProjectNodeOptions.
> >>> > > > Currently, Arrow creates an expression vector for the
> >>> > ProjectNodeOptions
> >>> > > > instance but leaves the names vector empty, and later defaults
> each
> >>> > name
> >>> > > to
> >>> > > > the ToString of each expression.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > We'd like your input on this issue. Currently, we see two
> >>> viable
> >>> > ways
> >>> > > > to resolve this issue: (1) add output field name for intermediate
> >>> > > relations
> >>> > > > in the Substrait plan so that Arrow can directly access them, or
> >>> (2)
> >>> > > > reproduce the field names in Arrow during deserialization by
> >>> creating
> >>> > > > Schema instances before ExecNodeOptions instances. In any case,
> we
> >>> > think
> >>> > > > there should be appropriate field names defined for each
> >>> relation/node.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > Note that due to this issue, any Arrow execution node that
> >>> prompts
> >>> > > the
> >>> > > > user to refer to fields by name might not play nicely with
> >>> Substrait.
> >>> > > > > >
> >>> > > > > >
> >>> > > > > > Cheers,
> >>> > > > > > Yaron.
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>
>

Re: [C++] output field names in Arrow Substrait

Reply via email to