Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Weston Pace Mon, 22 Apr 2024 19:16:16 -0700

I tend to agree with Dewey.  Using run-end-encoding to represent a scalar
is clever and would keep the c data interface more compact.  Also, a struct
array is a superset of a record batch (assuming the metadata is kept in the
schema).  Consumers should always be able to deserialize into a struct
array and then downcast to a record batch if that is what they want to do
(raising an error if there happen to be nulls).


> Depending on the function in question, it could be valid to pass a struct
> column vs a record batch with different results.

Are there any concrete examples where this is the case?  The closest
example I can think of is something like the `drop_nulls` function, which,
given a record batch, would choose to drop rows where any column is null
and, given an array, only drops rows where the top-level struct is null.
However, it might be clearer to just give the two functions different names
anyways.

On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

> Thank you for the background!
>
> I still wonder if these distinctions are the responsibility of the
> ArrowSchema to communicate (although perhaps links to the specific
> discussions would help highlight use-cases that I am not envisioning).
> I think these distinctions are definitely important in the contexts
> you mentioned; however, I am not sure that the FFI layer is going to
> be helpful.
>
> > In the libcudf situation, it came up with what happens if you pass a
> non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
>
> I suppose that I would have expected two functions (one to create a
> table and one to create a column). As a consumer I can't envision a
> situation where I would want to import an ArrowDeviceArray but where I
> would want some piece of run-time information to decide what the
> return type of the function would be? (With apologies if I am missing
> a piece of the discussion).
>
> > If A and B have different lengths, this is invalid
>
> I believe several array implementations (e.g., numpy, R) are able to
> broadcast/recycle a length-1 array. Run-end-encoding is also an option
> that would make that broadcast explicit without expanding the scalar.
>
> > Depending on the function in question, it could be valid to pass a
> struct column vs a record batch with different results.
>
> If this is an important distinction for an FFI signature of a UDF,
> there would probably be a struct definition for the UDF where there
> would be an opportunity to make this distinction (and perhaps others
> that are relevant) without loading this concept onto the existing
> structs.
>
> > If no flags are set, then the behavior shouldn't change
> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> > should error unless calling ImportRecordBatch.
>
> I am not sure I would have expected that (since a struct array has an
> unambiguous interpretation as a record batch and as a user I've very
> explicitly decided that I want one, since I'm using that function).
>
> In the other direction, I am not sure a producer would be able to set
> these flags without breaking backwards compatibility with earlier
> producers that did not set them (since earlier threads have suggested
> that it is good practice to error when an unsupported flag is
> encountered).
>
> On Sun, Apr 21, 2024 at 6:16 PM Matt Topol <zotthewiz...@gmail.com> wrote:
> >
> > First, I forgot a flag in my examples. There should also be an
> > ARROW_FLAG_SCALAR too!
> >
> > The motivation for this distinction came up from discussions during
> adding
> > support for ArrowDeviceArray to libcudf in order to better indicate the
> > difference between a cudf::table and a cudf::column which are handled
> quite
> > differently. This also relates to the fact that we currently need
> external
> > context like the explicit ImportArray() and ImportRecordBatch() functions
> > since we can't determine which a given ArrowArray is on its own. In the
> > libcudf situation, it came up with what happens if you pass a non-struct
> > column to the from_arrow_device method which returns a cudf::table?
> Should
> > it error, or should it create a table with a single column?
> >
> > The other motivation for this distinction is with UDFs in an engine that
> > uses the C data interface. When dealing with queries and engines, it
> > becomes important to be able to distinguish between a record batch, a
> > column and a scalar. For example, take the expression A + B:
> >
> > If A and B have different lengths, this is invalid..... unless one of
> them
> > is a Scalar. This is because Scalars are broadcastable, columns are not.
> >
> > Depending on the function in question, it could be valid to pass a struct
> > column vs a record batch with different results. It also resolves some
> > ambiguity for UDFs and processing. For instance, given a single
> ArrowArray
> > of length 1, which is a struct: Is that a Struct Column? A Record Batch?
> or
> > is it a scalar? There's no way to know what the producer's intention was
> or
> > the context without these flags or having to side-channel the information
> > somehow.
> >
> > > It seems like it may cause some ambiguous
> > situations...should C++'s ImportArray() error, for example, if the
> > schema has a ARROW_FLAG_RECORD_BATCH flag?
> >
> > I would argue yes. If no flags are set, then the behavior shouldn't
> change
> > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
> > should error unless calling ImportRecordBatch. It allows the producer to
> > provide context as to the source and intention of the structure of the
> data.
> >
> > --Matt
> >
> > On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
> > <de...@voltrondata.com.invalid> wrote:
> >
> > > Thanks for bringing this up!
> > >
> > > Could you share the motivation where this distinction is important in
> > > the context of transfer across the C data interface? The "struct ==
> > > record batch" concept has always made sense to me because in R, a
> > > data.frame can have a column that is also a data.frame and there is no
> > > distinction between the two. It seems like it may cause some ambiguous
> > > situations...should C++'s ImportArray() error, for example, if the
> > > schema has a ARROW_FLAG_RECORD_BATCH flag?
> > >
> > > Cheers,
> > >
> > > -dewey
> > >
> > > On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com>
> wrote:
> > > >
> > > > Hey everyone,
> > > >
> > > > With some of the other developments surrounding libraries adopting
> the
> > > > Arrow C Data interfaces, there's been a consistent question about
> > > handling
> > > > tables (record batch) vs columns vs scalars.
> > > >
> > > > Right now, a Record Batch is sent through the C interface as a struct
> > > > column whose children are the individual columns of the batch and a
> > > Scalar
> > > > would be sent through as just an array of length 1. Applications
> would
> > > have
> > > > to create their own contextual way of indicating whether the Array
> being
> > > > passed should be interpreted as just a single array/column or should
> be
> > > > treated as a full table/record batch.
> > > >
> > > > Rather than introducing new members or otherwise complicating the
> > > structs,
> > > > I wanted to gauge how people felt about introducing new flags for the
> > > > ArrowSchema object.
> > > >
> > > > Right now, we only have 3 defined flags:
> > > >
> > > > ARROW_FLAG_DICTIONARY_ORDERED
> > > > ARROW_FLAG_NULLABLE
> > > > ARROW_FLAG_MAP_KEYS_SORTED
> > > >
> > > > The flags member of the struct is an int64, so we have another 61
> bits to
> > > > play with! If no one has any strong objections, I wanted to propose
> > > adding
> > > > at least 2 new flags:
> > > >
> > > > ARROW_FLAG_RECORD_BATCH
> > > > ARROW_FLAG_SINGLE_COLUMN
> > > >
> > > > If neither flag is set, then it is contextual as to whether it
> should be
> > > > expected that the corresponding data is a table or a single column.
> If
> > > > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a
> > > > struct array and should be interpreted as a record batch by any
> consumers
> > > > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the
> > > > corresponding ArrowArray should be interpreted and utilized as a
> single
> > > > array/column regardless of its type.
> > > >
> > > > This provides a standardized way for producers of Arrow data to
> indicate
> > > in
> > > > the schema to consumers how the data they produced should be used
> (as a
> > > > table or column) rather than forcing everyone to come up with their
> own
> > > > contextualized way of handling things (extra arguments, differently
> named
> > > > functions for RecordBatch / Array, etc.).
> > > >
> > > > If there's no objections to this, I'll take a pass at implementing
> these
> > > > flags in C++ and Go to put up a PR and make a Vote thread. I just
> wanted
> > > to
> > > > see what others on the mailing list thought before I go ahead and put
> > > > effort into this.
> > > >
> > > > Thanks everyone! Take care!
> > > >
> > > > --Matt
> > >
>

Re: [DISCUSSION] New Flags for Arrow C Interface Schema

Reply via email to