Re: [C++] output field names in Arrow Substrait

Jacques Nadeau Sat, 23 Apr 2022 07:09:07 -0700

In the specification, there are both read and intermediate write rels. No
one has implemented the protobuf yet for write. Both carry field names.


The names of fields is an internal rel node concern just like condition is
for filter. This is because many formats require names. For example,
parquet and avro are frequently queried by name. That being said, I agree
that it should only be required if the format requires it.

I plan to write up a faq for this and other common questions on the
Substrait site.

On Wed, Apr 20, 2022, 12:32 PM Weston Pace <weston.p...@gmail.com> wrote:

> In the short term we would have a practical limitation moving Arrow's
> implementation to an index-only scheme.  The pyarrow and R
> implementations currently go directly to Arrow's internal
> representation and rely on Arrow's ability to resolve named field
> references in project/join/filter (which requires us to at least know
> field names at every node, even if we don't store them in the node
> itself).
>
> If pyarrow moved to Substrait for these APIs (I have no idea if anyone
> is thinking of doing that) then I suppose they would have to solve
> that problem anyways.
>
> > Ah, good point. Substrait doesn't have intermediate write relations yet
> > (at least not in protobuf), so I didn't think of that. It'd be more
> > precise to say that any relation that communicates a table to something
> > that isn't another Substrait relation should have an optional field to
> > specify the column names.
>
> Agreed.  In my mind, anytime we leave the plan, there is a sort of
> "serialization" step that has to happen where we go from "abstract
> data stream" to something more concrete.  For example, we might want
> to be able to specify field names, encodings, compression, batch
> sizes, etc.  Stuff that may or may not exist inside the plan but the
> producer doesn't have to care about until they actually start to think
> about how they receive the data.  Even when there isn't a write node I
> think this still happens, it just happens outside of the plan.  In
> Arrow this is the responsibility of the "sink node".  I wonder if we
> might want some equivalent in Substrait someday.
>
> On Wed, Apr 20, 2022 at 6:18 AM Yaron Gvili <rt...@hotmail.com> wrote:
> >
> > > It'd be more precise to say that any relation that communicates a
> table to something that isn't another Substrait relation should have an
> optional field to specify the column names.
> >
> > I think this can be automated, at least in most cases, assuming the
> (Substrait-consumer-side) reproduced column-names are natural/standardized.
> For example, in the cast_table example, the user would not need to specify
> "value1" and "value2" as output field-names for cast_table because these
> natural names would be the default.
> >
> >
> > Yaron.
> > ________________________________
> > From: Jeroen van Straten <jeroen.van.stra...@gmail.com>
> > Sent: Wednesday, April 20, 2022 11:54 AM
> > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > Subject: Re: [C++] output field names in Arrow Substrait
> >
> > > > The names at the relation root are a different story, since they
> specify
> > > > the column names for the produced table, but there's already a
> mechanism
> > > > for specifying those.
> >
> > > I feel this is a bit limited because I don't think relation roots are
> the
> > > only thing that produces tables.
> >
> > Ah, good point. Substrait doesn't have intermediate write relations yet
> > (at least not in protobuf), so I didn't think of that. It'd be more
> > precise to say that any relation that communicates a table to something
> > that isn't another Substrait relation should have an optional field to
> > specify the column names.
> >
> > On Wed, 20 Apr 2022 at 17:42, Li Jin <ice.xell...@gmail.com> wrote:
> >
> > > Sorry for the many typos :(
> > >
> > > > That's why I think computing schema during serialization might be a
> > > better solution than going for index based in Arrow.
> > >
> > > To clarify, by this I meant keeping the substrait representation and in
> > > Arrow consumer, compute/construct schema/column names from subtrait
> field
> > > indices.
> > >
> > > On Wed, Apr 20, 2022 at 11:25 AM Li Jin <ice.xell...@gmail.com> wrote:
> > >
> > > > >  I don't think Substratit cannot convince Spark to go index based.
> > > > Sorry I meant " I don't think Substrait can convince Spark to go
> index
> > > > based"
> > > >
> > > > On Wed, Apr 20, 2022 at 11:24 AM Li Jin <ice.xell...@gmail.com>
> wrote:
> > > >
> > > >> >  A third alternative might be to adjust Arrow to move to
> field-index
> > > >> based
> > > >> execution instead of storing schemas in its relational operators.
> > > >>
> > > >> This is an interesting idea - I do think there are operators that
> need
> > > >> column name information (e.g. a "checkpoint" node that writes to
> > > external
> > > >> storage). In this case, it sounds like any operator that requires
> > > knowing
> > > >> the column name needs to store additional information for the column
> > > names
> > > >> in addition to the field index, in its substrait representation.
> > > >>
> > > >> > The names at the relation root are a different story, since they
> > > >> specify the column names for the produced table, but there's
> already a
> > > >> mechanism for specifying those.
> > > >>
> > > >> I feel this is a bit limited because I don't think relation roots
> are
> > > the
> > > >> only thing that produces tables. I can think of two examples in
> Spark
> > > that
> > > >> produce tables and are not a relation root. (1) A checkpoint node
> that
> > > >> produces a table to write to external storage (2) A Python UDF node
> that
> > > >> materializes a slice of the table and produces a pd.DataFrame to
> pass
> > > down
> > > >> to a user function. I think we will see these cases in Arrow as
> well. I
> > > >> feel it might make sense to have a subclass of the operator that
> > > requires
> > > >> column names in addition to just root rel.
> > > >>
> > > >> *On schema-based vs index-based**:*
> > > >>
> > > >> From personal experience working on Spark it would say that having
> > > schema
> > > >> available in each operator is useful for developers - I often print
> > > >> input/output rows and I feel not having the column names available
> might
> > > >> put a tax on the developer. (But maybe there are other ways to solve
> > > this).
> > > >>
> > > >> Are there other substrait consumers at the moment and how do they
> deal
> > > >> with this? For example, if we were to have Spark consumes substrait
> I'd
> > > >> imagine there would be similar problems because schema is part of
> Spark
> > > >> operators too and I don't think Substratit cannot convince Spark to
> go
> > > >> index based.
> > > >> I'd imagine during deserialization of the subtrait plan to Spark it
> > > would
> > > >> need to compute a schema object for each operator. That's why I
> think
> > > >> computing schema during serialization might be a better solution
> than
> > > going
> > > >> for index based in Arrow.
> > > >>
> > > >>
> > > >> On Wed, Apr 20, 2022 at 11:14 AM Phillip Cloud <cpcl...@gmail.com>
> > > wrote:
> > > >>
> > > >>> On Wed, Apr 20, 2022, 11:31 Jeroen van Straten <
> > > >>> jeroen.van.stra...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > From a Substrait perspective, it would be up to Ibis to convert
> the
> > > >>> > column names to the correct indices. This can and should be
> > > transparent
> > > >>> > to the end user; front-ends already need to know what the
> schemas are
> > > >>> > in order to produce a valid Substrait plan, so Ibis should have
> all
> > > the
> > > >>> > information it needs to do this conversion.
> > > >>> >
> > > >>> > I would, however, agree that it's inconsistent for Substrait to
> > > require
> > > >>> > that column names are specified for read relations, but not for
> the
> > > >>> > other relations. AFAICT nothing on the consumer side can or
> should
> > > make
> > > >>> > use of those names. The only use case I can think of is to aid
> > > >>> > debugging; if that's the only reason they exist, it should
> indeed be
> > > >>> > possible to specify the names for each relation, but it should
> be in
> > > >>> > the form of as optional, non-functional hints in RelCommon. The
> names
> > > >>> > at the relation root are a different story, since they specify
> the
> > > >>> > column names for the produced table, but there's already a
> mechanism
> > > >>> > for specifying those. No idea if Arrow and Ibis use it (since
> it's
> > > >>> > optional), but Substrait supports it.
> > > >>> >
> > > >>>
> > > >>> ReadRels having column names is probably not strictly necessary
> for all
> > > >>> the
> > > >>> ReadRels, but for NamedTable it's necessary. For example, if I
> wanted
> > > to
> > > >>> represent a ReadRel independent of storage, that an API could
> > > deserialize
> > > >>> for metadata, and present that metadata to the user then you need
> > > column
> > > >>> names. Ibis uses the column names from ReadRels to turn Substrait
> back
> > > >>> into
> > > >>> expressions.
> > > >>>
> > > >>> The RelRoot is very much used in ibis. Every table expression
> compiles
> > > >>> to a
> > > >>> RelRoot.
> > > >>>
> > > >>>
> > > >>> > If there *are* special cases I'm not seeing, where the execution
> > > engine
> > > >>> > really does need the names of intermediate columns, I suppose it
> > > would
> > > >>> > be up to the engine to figure out what they would be. The
> information
> > > >>> > could also be added via Substrait "advanced extensions,"
> especially
> > > >>> > when a user overrides a name, but it wouldn't be a good idea for
> the
> > > >>> > consumer to rely on the existance of these extensions, for
> > > >>> > compatibility with other frameworks.
> > > >>>
> > > >>>
> > > >>> > If said special cases turn out to not be so special, a valid case
> > > could
> > > >>> > be made to pull that extension into Substrait itself. If they're
> only
> > > >>> > attached wherever the user manually overrides a name, I think the
> > > size
> > > >>> > explosion Phillip is talking about would be avoided. I *really*
> don't
> > > >>> > see why anyone would want to match *generated* names in any
> > > functional
> > > >>> > way; that's a recipe for undefined behavior.
> > > >>> >
> > > >>> > None of Substrait's built-in things make use of column names
> though,
> > > >>> > which implies that any Arrow execution node that *does* need it
> > > cannot
> > > >>> > be described with Substrait, and thus should never appear in an
> > > >>> > execution plan deserialized from Substrait. If it currently does,
> > > >>> > that's an issue with Arrow's Substrait consumer. If we'd want to
> > > >>> > correctly represent those kinds of nodes anyway, we'd already be
> > > >>> > dealing with an extension relation, to which we could attach
> whatever
> > > >>> > information we want.
> > > >>> >
> > > >>> > On Wed, 20 Apr 2022 at 16:05, Phillip Cloud <cpcl...@gmail.com>
> > > wrote:
> > > >>> >
> > > >>> > > A third alternative might be to adjust Arrow to move to
> field-index
> > > >>> based
> > > >>> > > execution instead of storing schemas in its relational
> operators.
> > > >>> > >
> > > >>> > > I suspect that deserializing the schema on the arrow side is
> the
> > > >>> right
> > > >>> > > solution in at least the near term. Repeating field names where
> > > they
> > > >>> > aren't
> > > >>> > > strictly necessary has a tendency to enlarge plans, and the
> effect
> > > >>> gets
> > > >>> > > worse as plans go through transformations.
> > > >>> > >
> > > >>> > > On Wed, Apr 20, 2022 at 6:08 AM Yaron Gvili <rt...@hotmail.com
> >
> > > >>> wrote:
> > > >>> > >
> > > >>> > > > Hi Weston,
> > > >>> > > >
> > > >>> > > > To reiterate, the scenario is not a user writing a Substrait
> plan
> > > >>> by
> > > >>> > > hand,
> > > >>> > > > but a user writing an Ibis expression and the remaining
> steps -
> > > >>> > Substrait
> > > >>> > > > compilation, serialization to protobuf, deserialization to an
> > > Arrow
> > > >>> > plan,
> > > >>> > > > and execution of the plan - are done systematically. This
> issue
> > > is
> > > >>> > > related
> > > >>> > > > to Ibis, Ibis-Substrait, and Arrow together, so multiple
> persons
> > > >>> may
> > > >>> > need
> > > >>> > > > to bear on this issue.
> > > >>> > > > Index based field references should still work here.  For
> > > example,
> > > >>> if
> > > >>> > > > I have tables:
> > > >>> > > >
> > > >>> > > > Key | LeftPayload    --    Key | RightPayload
> > > >>> > > >
> > > >>> > > > The join expression would be field(0) == field(2)
> > > >>> > > > Right, but the point in this use case is that it would not be
> > > >>> > convenient
> > > >>> > > > for a user writing an Ibis expression to specify each key of
> each
> > > >>> > table,
> > > >>> > > > especially when plenty of (say, 100) tables are being
> joined. It
> > > >>> would
> > > >>> > be
> > > >>> > > > much more convenient for the user to specify once the name
> of the
> > > >>> key
> > > >>> > > that
> > > >>> > > > exists in all tables. This is a use case where the key should
> > > (not
> > > >>> > must)
> > > >>> > > be
> > > >>> > > > specified by name for convenience.
> > > >>> > > > In both of those cases the field names are not part of the
> plan
> > > >>> itself.
> > > >>> > > > In the second use case, the user writing the ibis expression
> is
> > > >>> > > specifying
> > > >>> > > > a string-name as a parameter to a relation that would later
> get
> > > >>> passed
> > > >>> > to
> > > >>> > > > an Arrow execution node in an options instance and used to
> > > >>> dynamically
> > > >>> > > set
> > > >>> > > > up field-names. This string-name does in fact appear
> explicitly
> > > in
> > > >>> the
> > > >>> > > > Substrait plan, not for convenience but of necessity. What
> does
> > > not
> > > >>> > > appear
> > > >>> > > > in the Substrait plan are output field-names of intermediate
> > > >>> relations;
> > > >>> > > > this in itself is not a problem, because these field-names
> can
> > > (in
> > > >>> > > > principle) be recomputed to allow them to be matched to the
> > > >>> dynamically
> > > >>> > > > set-up field-names. But the current implementation of Arrow
> (in
> > > its
> > > >>> > > > Substrait module) does not do this because the schemata are
> not
> > > >>> > available
> > > >>> > > > during deserialization, when the option instances (such as
> > > >>> > > > ProjectNodeOptions) that require the schemata are created.
> > > Instead,
> > > >>> > Arrow
> > > >>> > > > (in its Substrait module) ends up with non-natural
> field-names
> > > like
> > > >>> > > > "FieldPath(1)" that fail to be matched.
> > > >>> > > >
> > > >>> > > > That's why this is not a Substrait specific problem, but one
> that
> > > >>> is
> > > >>> > > > related to Ibis, Ibis-Substrait, and Arrow together. We
> think it
> > > >>> can be
> > > >>> > > > resolved either by changing Substrait to assist Arrow by
> > > >>> specifying the
> > > >>> > > > field-names of intermediate relations in the plan, so that
> they
> > > are
> > > >>> > > readily
> > > >>> > > > available during deserialization, or by changing Arrow (in
> its
> > > >>> > Substrait
> > > >>> > > > module) to reproduce the field-names during deserialization,
> like
> > > >>> by
> > > >>> > > > computing the schemata before options and node instances are
> > > >>> created.
> > > >>> > We
> > > >>> > > > might come up with other solutions in this discussion, of
> course.
> > > >>> > Either
> > > >>> > > > way, for this second use case, the field-names of
> intermediate
> > > >>> > relations
> > > >>> > > > must be natural; they cannot be left as something like
> > > >>> "FieldPath(1)".
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > Yaron.
> > > >>> > > > ________________________________
> > > >>> > > > From: Weston Pace <weston.p...@gmail.com>
> > > >>> > > > Sent: Tuesday, April 19, 2022 7:12 PM
> > > >>> > > > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > > >>> > > > Cc: Li Jin <ice.xell...@gmail.com>
> > > >>> > > > Subject: Re: [C++] output field names in Arrow Substrait
> > > >>> > > >
> > > >>> > > > > However, the problem is there are natural cases in
> > > >>> > > > > which an execution node should or must take in a
> string-name
> > > >>> > > >
> > > >>> > > > If we can come up with such a case then I agree it would be a
> > > >>> problem
> > > >>> > > > for Substrait's current definition.  I don't think we can
> come up
> > > >>> with
> > > >>> > > > such a case.  Every column that can be referenced by name
> has a
> > > >>> unique
> > > >>> > > > index we could use instead.
> > > >>> > > >
> > > >>> > > > > One use case is an execution node that is joining N
> > > >>> > > > > input tables on a column-name that exists in all of them.
> > > >>> > > >
> > > >>> > > > Index based field references should still work here.  For
> > > example,
> > > >>> if
> > > >>> > > > I have tables:
> > > >>> > > >
> > > >>> > > > Key | LeftPayload    --    Key | RightPayload
> > > >>> > > >
> > > >>> > > > The join expression would be field(0) == field(2)
> > > >>> > > >
> > > >>> > > > > The execution node would make itself more convenient
> > > >>> > > > > for the user by allowing specifying a string-name than by
> > > >>> > > > > specifying N FieldRef instances (each with the same
> > > >>> > > > > name but on a different input table) like the above
> > > >>> > > > > requirement would force
> > > >>> > > >
> > > >>> > > > Substrait plans aren't normally created by humans.  I'm not
> sure
> > > >>> > > > convenience is a factor here.
> > > >>> > > >
> > > >>> > > > > Another use case for a string-name is when it is used
> within
> > > >>> > > > > the execution node to dynamically set up field-names,
> e.g., a
> > > >>> > > > > node that operates on the input's columns whose name starts
> > > >>> > > > > with the given string-name or a node that operates on an
> input
> > > >>> > > > > column whose name is given as data in another input column.
> > > >>> > > >
> > > >>> > > > In both of those cases the field names are not part of the
> plan
> > > >>> itself.
> > > >>> > > >
> > > >>> > > > On Tue, Apr 19, 2022 at 9:16 AM Yaron Gvili <
> rt...@hotmail.com>
> > > >>> wrote:
> > > >>> > > > >
> > > >>> > > > > Hi Weston,
> > > >>> > > > >
> > > >>> > > > > Thanks for the quick response.
> > > >>> > > > > I think you might have forgotten the links for [1][2][3]
> > > >>> > > > > Sorry about the confusion; I use these not as references to
> > > >>> links but
> > > >>> > > as
> > > >>> > > > markers of points I make in the beginning that I elaborate on
> > > >>> later, in
> > > >>> > > the
> > > >>> > > > places where I reuse the markers.
> > > >>> > > > > Are you going from Substrait to an Arrow execution plan?
> > > >>> > > > > Yes.
> > > >>> > > > > For Substrait -> Arrow most of our execution nodes should
> take
> > > >>> in a
> > > >>> > > > FieldRef which can be a name but can also be index-based. So
> I
> > > >>> wouldn't
> > > >>> > > > expect Substrait's exclusive use of index-based references
> to be
> > > an
> > > >>> > > issue.
> > > >>> > > > > Yes, I agree this is a design feature of Substrait and
> that if
> > > >>> the
> > > >>> > > > execution node takes in a FieldRef then there won't be a
> problem
> > > >>> with
> > > >>> > > > Substrait use of an index-based reference for it. If any
> > > >>> > to-be-developed
> > > >>> > > > execution node must indeed use a FieldRef when taking in a
> > > >>> parameter
> > > >>> > that
> > > >>> > > > refers to columns, then this requirement should be
> documented for
> > > >>> > > > developers of execution nodes.
> > > >>> > > > >
> > > >>> > > > > However, the problem is there are natural cases in which an
> > > >>> execution
> > > >>> > > > node should or must take in a string-name; these suffer from
> the
> > > >>> above
> > > >>> > > > design feature due to the non-natural field names that it
> leads
> > > >>> to. One
> > > >>> > > use
> > > >>> > > > case is an execution node that is joining N input tables on a
> > > >>> > column-name
> > > >>> > > > that exists in all of them. The execution node would make
> itself
> > > >>> more
> > > >>> > > > convenient for the user by allowing specifying a string-name
> than
> > > >>> by
> > > >>> > > > specifying N FieldRef instances (each with the same name but
> on a
> > > >>> > > different
> > > >>> > > > input table) like the above requirement would force. Another
> use
> > > >>> case
> > > >>> > > for a
> > > >>> > > > string-name is when it is used within the execution node to
> > > >>> dynamically
> > > >>> > > set
> > > >>> > > > up field-names, e.g., a node that operates on the input's
> columns
> > > >>> whose
> > > >>> > > > name starts with the given string-name or a node that
> operates on
> > > >>> an
> > > >>> > > input
> > > >>> > > > column whose name is given as data in another input column.
> > > >>> > > > >
> > > >>> > > > > This is the main reason we think there should be
> appropriate
> > > >>> > (natural)
> > > >>> > > > field names defined for each relation/node.
> > > >>> > > > > Also keep in mind that Substrait isn't exactly a query
> language
> > > >>> for
> > > >>> > > > user's to be typing by hand.
> > > >>> > > > > Yes, I took that into consideration. Namely, the above
> > > discussion
> > > >>> > > refers
> > > >>> > > > to the scenario of a user writing an Ibis expression and the
> > > >>> remaining
> > > >>> > > > steps - Substrait compilation, serialization to protobuf,,
> > > >>> > > deserialization
> > > >>> > > > to an Arrow plan, and execution of the plan - are done
> > > >>> systematically.
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > Yaron.
> > > >>> > > > > ________________________________
> > > >>> > > > > From: Weston Pace <weston.p...@gmail.com>
> > > >>> > > > > Sent: Tuesday, April 19, 2022 1:01 PM
> > > >>> > > > > To: dev@arrow.apache.org <dev@arrow.apache.org>
> > > >>> > > > > Cc: Li Jin <ice.xell...@gmail.com>
> > > >>> > > > > Subject: Re: [C++] output field names in Arrow Substrait
> > > >>> > > > >
> > > >>> > > > > Hi Yaron, I think you might have forgotten the links for
> > > >>> [1][2][3] so
> > > >>> > > > > I'm not entirely sure of the context.  Are you going from
> > > >>> Substrait
> > > >>> > to
> > > >>> > > > > an Arrow execution plan?  Or are you going from an Arrow
> > > >>> execution
> > > >>> > > > > plan to Substrait?
> > > >>> > > > >
> > > >>> > > > > For Substrait -> Arrow most of our execution nodes should
> take
> > > >>> in a
> > > >>> > > > > FieldRef which can be a name but can also be index-based.
> So I
> > > >>> > > > > wouldn't expect Substrait's exclusive use of index-based
> > > >>> references
> > > >>> > to
> > > >>> > > > > be an issue.  Also keep in mind that Substrait isn't
> exactly a
> > > >>> query
> > > >>> > > > > language for user's to be typing by hand.  So, for
> example, if
> > > a
> > > >>> user
> > > >>> > > > > wants a join and is using Ibis they could type:
> > > >>> > > > >
> > > >>> > > > > table.semi_join(s, table.x == table.y)
> > > >>> > > > >
> > > >>> > > > > Ibis is then responsible for converting "x" and "y" to the
> > > >>> > appropriate
> > > >>> > > > > indices (and it should have all the information needed to
> do
> > > so).
> > > >>> > The
> > > >>> > > > > Substrait plan will refer to these nodes by index and the
> > > >>> > > > > corresponding Arrow execution plan will use integer-based
> > > >>> FieldRef.
> > > >>> > > > >
> > > >>> > > > > For Arrow -> Substrait then I agree that this could become
> a
> > > >>> problem.
> > > >>> > > > > Right now the Arrow -> Substrait path is mainly used for
> > > internal
> > > >>> > > > > testing with the hope that Substrait -> Arrow -> Substrait
> > > should
> > > >>> > > > > generally be possible (with zero loss of information).
> This
> > > >>> does not
> > > >>> > > > > mean that every Arrow plan will be convertible into
> Substrait.
> > > >>> That
> > > >>> > > > > is certainly a potential goal, and PRs to add that
> capability
> > > >>> would
> > > >>> > be
> > > >>> > > > > welcome, but I don't know if anyone working on the
> > > >>> Arrow/Substrait
> > > >>> > > > > integration has that goal in mind.  If that is your goal I
> > > might
> > > >>> be
> > > >>> > > > > curious to learn more about your use cases.
> > > >>> > > > >
> > > >>> > > > > On Tue, Apr 19, 2022 at 6:11 AM Yaron Gvili <
> rt...@hotmail.com
> > > >
> > > >>> > wrote:
> > > >>> > > > > >
> > > >>> > > > > > Hi,
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > We ran into an issue due to the fact that, for
> intermediate
> > > >>> > > relations,
> > > >>> > > > Substrait does not automatically compute output field names
> nor
> > > >>> allows
> > > >>> > > one
> > > >>> > > > to explicitly name output fields [1]. This leads to trouble
> when
> > > >>> one
> > > >>> > > needs
> > > >>> > > > to refer to these output fields by name [2]. We run into this
> > > >>> trouble
> > > >>> > > when
> > > >>> > > > deserializing in Arrow [3] (based on commit 8e13c2dd).
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > One use case where [1] occurs is:
> > > >>> > > > > >
> > > >>> > > > > > import ibis
> > > >>> > > > > >
> > > >>> > > > > > import pyarrow as pa
> > > >>> > > > > >
> > > >>> > > > > > loaded_table = ibis.local_table(...)
> > > >>> > > > > >
> > > >>> > > > > > cast_table = ibis.mutate(loaded_table,
> > > >>> > > > time=loaded_table.time.cast(pa.int64())
> > > >>> > > > > >
> > > >>> > > > > > Even if loaded_table has nicely named fields, like
> "time" and
> > > >>> > > "value1"
> > > >>> > > > and "value2", the resulting cast_table (when deserialized in
> > > >>> Arrow) has
> > > >>> > > > inconvenient field names like "FieldPath(1)" and
> "FieldPath(2)"
> > > >>> for all
> > > >>> > > but
> > > >>> > > > the "time" field. I believe that Substrait has all the
> > > information
> > > >>> to
> > > >>> > > > automatically compute the field name "value*" for
> cast_table, but
> > > >>> it
> > > >>> > > > doesn't do this. Moreover, the caller has to know the schema
> of
> > > >>> > > > loaded_table in order to set the output field names for
> > > >>> cast_table, and
> > > >>> > > > this is not convenient. When cast_table is an intermediate
> > > relation
> > > >>> > > (i.e.,
> > > >>> > > > not the root relation of the plan), Substrait also doesn't
> allow
> > > >>> the
> > > >>> > > caller
> > > >>> > > > to explicitly name the output fields, and there is no place
> for
> > > >>> these
> > > >>> > > field
> > > >>> > > > names in the Substrait protobuf.
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > One use case where [2] occurs is in a join-type operation
> > > over
> > > >>> > > > cast_table. The caller would normally like to use a field
> name
> > > >>> (and not
> > > >>> > > an
> > > >>> > > > index) to refer to the join-key. Even if the caller knows the
> > > field
> > > >>> > name
> > > >>> > > > for the join-key in loaded_table, many field names in
> cast_table
> > > >>> (when
> > > >>> > > > deserialized in Arrow) are different (and each includes an
> index
> > > >>> in it)
> > > >>> > > > than those of loaded_table.
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > The case of [3] occurs because Arrow deserializes
> > > >>> ExecNodeOption
> > > >>> > > > instances before it has Schema instances at hand. At this
> stage,
> > > >>> > without
> > > >>> > > > schemata, Arrow cannot compute field names that should be
> placed
> > > >>> in a
> > > >>> > > > ExecNodeOption that needs them, in particular
> ProjectNodeOptions.
> > > >>> > > > Currently, Arrow creates an expression vector for the
> > > >>> > ProjectNodeOptions
> > > >>> > > > instance but leaves the names vector empty, and later
> defaults
> > > each
> > > >>> > name
> > > >>> > > to
> > > >>> > > > the ToString of each expression.
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > We'd like your input on this issue. Currently, we see two
> > > >>> viable
> > > >>> > ways
> > > >>> > > > to resolve this issue: (1) add output field name for
> intermediate
> > > >>> > > relations
> > > >>> > > > in the Substrait plan so that Arrow can directly access
> them, or
> > > >>> (2)
> > > >>> > > > reproduce the field names in Arrow during deserialization by
> > > >>> creating
> > > >>> > > > Schema instances before ExecNodeOptions instances. In any
> case,
> > > we
> > > >>> > think
> > > >>> > > > there should be appropriate field names defined for each
> > > >>> relation/node.
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > Note that due to this issue, any Arrow execution node
> that
> > > >>> prompts
> > > >>> > > the
> > > >>> > > > user to refer to fields by name might not play nicely with
> > > >>> Substrait.
> > > >>> > > > > >
> > > >>> > > > > >
> > > >>> > > > > > Cheers,
> > > >>> > > > > > Yaron.
> > > >>> > > >
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > >
>

Re: [C++] output field names in Arrow Substrait

Reply via email to