First, I forgot a flag in my examples. There should also be an
ARROW_FLAG_SCALAR too!

The motivation for this distinction came up from discussions during adding
support for ArrowDeviceArray to libcudf in order to better indicate the
difference between a cudf::table and a cudf::column which are handled quite
differently. This also relates to the fact that we currently need external
context like the explicit ImportArray() and ImportRecordBatch() functions
since we can't determine which a given ArrowArray is on its own. In the
libcudf situation, it came up with what happens if you pass a non-struct
column to the from_arrow_device method which returns a cudf::table? Should
it error, or should it create a table with a single column?

The other motivation for this distinction is with UDFs in an engine that
uses the C data interface. When dealing with queries and engines, it
becomes important to be able to distinguish between a record batch, a
column and a scalar. For example, take the expression A + B:

If A and B have different lengths, this is invalid..... unless one of them
is a Scalar. This is because Scalars are broadcastable, columns are not.

Depending on the function in question, it could be valid to pass a struct
column vs a record batch with different results. It also resolves some
ambiguity for UDFs and processing. For instance, given a single ArrowArray
of length 1, which is a struct: Is that a Struct Column? A Record Batch? or
is it a scalar? There's no way to know what the producer's intention was or
the context without these flags or having to side-channel the information
somehow.

> It seems like it may cause some ambiguous
situations...should C++'s ImportArray() error, for example, if the
schema has a ARROW_FLAG_RECORD_BATCH flag?

I would argue yes. If no flags are set, then the behavior shouldn't change
from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it
should error unless calling ImportRecordBatch. It allows the producer to
provide context as to the source and intention of the structure of the data.

--Matt

On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington
<de...@voltrondata.com.invalid> wrote:

> Thanks for bringing this up!
>
> Could you share the motivation where this distinction is important in
> the context of transfer across the C data interface? The "struct ==
> record batch" concept has always made sense to me because in R, a
> data.frame can have a column that is also a data.frame and there is no
> distinction between the two. It seems like it may cause some ambiguous
> situations...should C++'s ImportArray() error, for example, if the
> schema has a ARROW_FLAG_RECORD_BATCH flag?
>
> Cheers,
>
> -dewey
>
> On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com> wrote:
> >
> > Hey everyone,
> >
> > With some of the other developments surrounding libraries adopting the
> > Arrow C Data interfaces, there's been a consistent question about
> handling
> > tables (record batch) vs columns vs scalars.
> >
> > Right now, a Record Batch is sent through the C interface as a struct
> > column whose children are the individual columns of the batch and a
> Scalar
> > would be sent through as just an array of length 1. Applications would
> have
> > to create their own contextual way of indicating whether the Array being
> > passed should be interpreted as just a single array/column or should be
> > treated as a full table/record batch.
> >
> > Rather than introducing new members or otherwise complicating the
> structs,
> > I wanted to gauge how people felt about introducing new flags for the
> > ArrowSchema object.
> >
> > Right now, we only have 3 defined flags:
> >
> > ARROW_FLAG_DICTIONARY_ORDERED
> > ARROW_FLAG_NULLABLE
> > ARROW_FLAG_MAP_KEYS_SORTED
> >
> > The flags member of the struct is an int64, so we have another 61 bits to
> > play with! If no one has any strong objections, I wanted to propose
> adding
> > at least 2 new flags:
> >
> > ARROW_FLAG_RECORD_BATCH
> > ARROW_FLAG_SINGLE_COLUMN
> >
> > If neither flag is set, then it is contextual as to whether it should be
> > expected that the corresponding data is a table or a single column. If
> > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a
> > struct array and should be interpreted as a record batch by any consumers
> > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the
> > corresponding ArrowArray should be interpreted and utilized as a single
> > array/column regardless of its type.
> >
> > This provides a standardized way for producers of Arrow data to indicate
> in
> > the schema to consumers how the data they produced should be used (as a
> > table or column) rather than forcing everyone to come up with their own
> > contextualized way of handling things (extra arguments, differently named
> > functions for RecordBatch / Array, etc.).
> >
> > If there's no objections to this, I'll take a pass at implementing these
> > flags in C++ and Go to put up a PR and make a Vote thread. I just wanted
> to
> > see what others on the mailing list thought before I go ahead and put
> > effort into this.
> >
> > Thanks everyone! Take care!
> >
> > --Matt
>

Reply via email to