I tend to agree with Dewey. Using run-end-encoding to represent a scalar is clever and would keep the c data interface more compact. Also, a struct array is a superset of a record batch (assuming the metadata is kept in the schema). Consumers should always be able to deserialize into a struct array and then downcast to a record batch if that is what they want to do (raising an error if there happen to be nulls).
> Depending on the function in question, it could be valid to pass a struct > column vs a record batch with different results. Are there any concrete examples where this is the case? The closest example I can think of is something like the `drop_nulls` function, which, given a record batch, would choose to drop rows where any column is null and, given an array, only drops rows where the top-level struct is null. However, it might be clearer to just give the two functions different names anyways. On Mon, Apr 22, 2024 at 1:01 PM Dewey Dunnington <de...@voltrondata.com.invalid> wrote: > Thank you for the background! > > I still wonder if these distinctions are the responsibility of the > ArrowSchema to communicate (although perhaps links to the specific > discussions would help highlight use-cases that I am not envisioning). > I think these distinctions are definitely important in the contexts > you mentioned; however, I am not sure that the FFI layer is going to > be helpful. > > > In the libcudf situation, it came up with what happens if you pass a > non-struct > > column to the from_arrow_device method which returns a cudf::table? > Should > > it error, or should it create a table with a single column? > > I suppose that I would have expected two functions (one to create a > table and one to create a column). As a consumer I can't envision a > situation where I would want to import an ArrowDeviceArray but where I > would want some piece of run-time information to decide what the > return type of the function would be? (With apologies if I am missing > a piece of the discussion). > > > If A and B have different lengths, this is invalid > > I believe several array implementations (e.g., numpy, R) are able to > broadcast/recycle a length-1 array. Run-end-encoding is also an option > that would make that broadcast explicit without expanding the scalar. > > > Depending on the function in question, it could be valid to pass a > struct column vs a record batch with different results. > > If this is an important distinction for an FFI signature of a UDF, > there would probably be a struct definition for the UDF where there > would be an opportunity to make this distinction (and perhaps others > that are relevant) without loading this concept onto the existing > structs. > > > If no flags are set, then the behavior shouldn't change > > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it > > should error unless calling ImportRecordBatch. > > I am not sure I would have expected that (since a struct array has an > unambiguous interpretation as a record batch and as a user I've very > explicitly decided that I want one, since I'm using that function). > > In the other direction, I am not sure a producer would be able to set > these flags without breaking backwards compatibility with earlier > producers that did not set them (since earlier threads have suggested > that it is good practice to error when an unsupported flag is > encountered). > > On Sun, Apr 21, 2024 at 6:16 PM Matt Topol <zotthewiz...@gmail.com> wrote: > > > > First, I forgot a flag in my examples. There should also be an > > ARROW_FLAG_SCALAR too! > > > > The motivation for this distinction came up from discussions during > adding > > support for ArrowDeviceArray to libcudf in order to better indicate the > > difference between a cudf::table and a cudf::column which are handled > quite > > differently. This also relates to the fact that we currently need > external > > context like the explicit ImportArray() and ImportRecordBatch() functions > > since we can't determine which a given ArrowArray is on its own. In the > > libcudf situation, it came up with what happens if you pass a non-struct > > column to the from_arrow_device method which returns a cudf::table? > Should > > it error, or should it create a table with a single column? > > > > The other motivation for this distinction is with UDFs in an engine that > > uses the C data interface. When dealing with queries and engines, it > > becomes important to be able to distinguish between a record batch, a > > column and a scalar. For example, take the expression A + B: > > > > If A and B have different lengths, this is invalid..... unless one of > them > > is a Scalar. This is because Scalars are broadcastable, columns are not. > > > > Depending on the function in question, it could be valid to pass a struct > > column vs a record batch with different results. It also resolves some > > ambiguity for UDFs and processing. For instance, given a single > ArrowArray > > of length 1, which is a struct: Is that a Struct Column? A Record Batch? > or > > is it a scalar? There's no way to know what the producer's intention was > or > > the context without these flags or having to side-channel the information > > somehow. > > > > > It seems like it may cause some ambiguous > > situations...should C++'s ImportArray() error, for example, if the > > schema has a ARROW_FLAG_RECORD_BATCH flag? > > > > I would argue yes. If no flags are set, then the behavior shouldn't > change > > from what it is now. If the ARROW_FLAG_RECORD_BATCH flag is set, then it > > should error unless calling ImportRecordBatch. It allows the producer to > > provide context as to the source and intention of the structure of the > data. > > > > --Matt > > > > On Fri, Apr 19, 2024 at 8:24 PM Dewey Dunnington > > <de...@voltrondata.com.invalid> wrote: > > > > > Thanks for bringing this up! > > > > > > Could you share the motivation where this distinction is important in > > > the context of transfer across the C data interface? The "struct == > > > record batch" concept has always made sense to me because in R, a > > > data.frame can have a column that is also a data.frame and there is no > > > distinction between the two. It seems like it may cause some ambiguous > > > situations...should C++'s ImportArray() error, for example, if the > > > schema has a ARROW_FLAG_RECORD_BATCH flag? > > > > > > Cheers, > > > > > > -dewey > > > > > > On Fri, Apr 19, 2024 at 6:34 PM Matt Topol <zotthewiz...@gmail.com> > wrote: > > > > > > > > Hey everyone, > > > > > > > > With some of the other developments surrounding libraries adopting > the > > > > Arrow C Data interfaces, there's been a consistent question about > > > handling > > > > tables (record batch) vs columns vs scalars. > > > > > > > > Right now, a Record Batch is sent through the C interface as a struct > > > > column whose children are the individual columns of the batch and a > > > Scalar > > > > would be sent through as just an array of length 1. Applications > would > > > have > > > > to create their own contextual way of indicating whether the Array > being > > > > passed should be interpreted as just a single array/column or should > be > > > > treated as a full table/record batch. > > > > > > > > Rather than introducing new members or otherwise complicating the > > > structs, > > > > I wanted to gauge how people felt about introducing new flags for the > > > > ArrowSchema object. > > > > > > > > Right now, we only have 3 defined flags: > > > > > > > > ARROW_FLAG_DICTIONARY_ORDERED > > > > ARROW_FLAG_NULLABLE > > > > ARROW_FLAG_MAP_KEYS_SORTED > > > > > > > > The flags member of the struct is an int64, so we have another 61 > bits to > > > > play with! If no one has any strong objections, I wanted to propose > > > adding > > > > at least 2 new flags: > > > > > > > > ARROW_FLAG_RECORD_BATCH > > > > ARROW_FLAG_SINGLE_COLUMN > > > > > > > > If neither flag is set, then it is contextual as to whether it > should be > > > > expected that the corresponding data is a table or a single column. > If > > > > ARROW_FLAG_RECORD_BATCH is set, then the corresponding data MUST be a > > > > struct array and should be interpreted as a record batch by any > consumers > > > > (erroring otherwise). If ARROW_FLAG_SINGLE_COLUMN is set, then the > > > > corresponding ArrowArray should be interpreted and utilized as a > single > > > > array/column regardless of its type. > > > > > > > > This provides a standardized way for producers of Arrow data to > indicate > > > in > > > > the schema to consumers how the data they produced should be used > (as a > > > > table or column) rather than forcing everyone to come up with their > own > > > > contextualized way of handling things (extra arguments, differently > named > > > > functions for RecordBatch / Array, etc.). > > > > > > > > If there's no objections to this, I'll take a pass at implementing > these > > > > flags in C++ and Go to put up a PR and make a Vote thread. I just > wanted > > > to > > > > see what others on the mailing list thought before I go ahead and put > > > > effort into this. > > > > > > > > Thanks everyone! Take care! > > > > > > > > --Matt > > > >