First, I forgot a flag in my examples. There should also be an ARROW_FLAG_SCALAR too!
The motivation for this distinction came up from discussions during adding support for ArrowDeviceArray to libcudf in order to better indicate the difference between a cudf::table and a cudf::column which are handled quite differently. This also relates to the fact that we currently need external context like the explicit ImportArray() and ImportRecordBatch() functions since we can't determine which a given ArrowArray is on its own. In the libcudf situation, it came up with what happens if you pass a non-struct column to the from_arrow_device method which returns a cudf::table? Should it error, or should it create a table with a single column? The other motivation for this distinction is with UDFs in an engine that uses the C data interface. When dealing with queries and engines, it becomes important to be able to distinguish between a record batch, a column and a scalar. For example, take the expression A + B: If A and B have different lengths, this is invalid..... unless one of them is a Scalar. This is because Scalars are broadcastable, columns are not. Depending on the function in question, it could be valid to pass a struct column vs a record batch with different results. It also resolves some ambiguity for UDFs and processing. For instance, given a single ArrowArray of length 1, which is a struct: Is that a Struct Column? A Record Batch? or is it a scalar? There's no way to know what the producer's intention was or the context without these flags or having to side-channel the information somehow. > It seems like it may cause some ambiguous situations...should C++'s ImportArray() error, for example, if the schema has a ARROW_FLAG_RECORD_BATCH flag? I would argue yes. If no flags are set, then the behavior shouldn't change from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it should error unless calling ImportRecordBatch. It allows the producer to provide context as to the source and intention of the structure of the data. --Matt On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > Thanks for bringing this up! > > Could you share the motivation where this distinction is important in > the context of transfer across the C data interface? The "struct == > record batch" concept has always made sense to me because in R, a > data.frame can have a column that is also a data.frame and there is no > distinction between the two. It seems like it may cause some ambiguous > situations...should C++'s ImportArray() error, for example, if the > schema has a ARROW_FLAG_RECORD_BATCH flag? > > Cheers, > > -dewey > > On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com> wrote: > > > > Hey everyone, > > > > With some of the other developments surrounding libraries adopting the > > Arrow C Data interfaces, there's been a consistent question about > handling > > tables (record batch) vs columns vs scalars. > > > > Right now, a Record Batch is sent through the C interface as a struct > > column whose children are the individual columns of the batch and a > Scalar > > would be sent through as just an array of length 1. Applications would > have > > to create their own contextual way of indicating whether the Array being > > passed should be interpreted as just a single array/column or should be > > treated as a full table/record batch. > > > > Rather than introducing new members or otherwise complicating the > structs, > > I wanted to gauge how people felt about introducing new flags for the > > ArrowSchema object. > > > > Right now, we only have 3 defined flags: > > > > ARROW_FLAG_DICTIONARY_ORDERED > > ARROW_FLAG_NULLABLE > > ARROW_FLAG_MAP_KEYS_SORTED > > > > The flags member of the struct is an int64, so we have another 61 bits to > > play with! If no one has any strong objections, I wanted to propose > adding > > at least 2 new flags: > > > > ARROW_FLAG_RECORD_BATCH > > ARROW_FLAG_SINGLE_COLUMN > > > > If neither flag is set, then it is contextual as to whether it should be > > expected that the corresponding data is a table or a single column. If > > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a > > struct array and should be interpreted as a record batch by any consumers > > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the > > corresponding ArrowArray should be interpreted and utilized as a single > > array/column regardless of its type. > > > > This provides a standardized way for producers of Arrow data to indicate > in > > the schema to consumers how the data they produced should be used (as a > > table or column) rather than forcing everyone to come up with their own > > contextualized way of handling things (extra arguments, differently named > > functions for RecordBatch / Array, etc.). > > > > If there's no objections to this, I'll take a pass at implementing these > > flags in C++ and Go to put up a PR and make a Vote thread. I just wanted > to > > see what others on the mailing list thought before I go ahead and put > > effort into this. > > > > Thanks everyone! Take care! > > > > --Matt >