In the absence of a general solution to the C data interface omitting buffer sizes, I think the original proposal is the best way forward...this is the first type to be added whose buffer sizes cannot be calculated without looping over every element of the array; the buffer sizes are needed to efficiently serialize the imported array to IPC if imported by a consumer that cares about buffer sizes.
Using a schema's flags to indicate something about a specific paired array (particularly one that, if misinterpreted, would lead to a crash) is a precedent that is probably not worth introducing for just one type. Currently a schema is completely independent of any particular ArrowArray, and I think that is a feature that is worth preserving. My gripes about not having buffer sizes on the CPU to more efficiently copy between devices is a concept almost certainly better suited to the ArrowDeviceArray struct. On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman <bengil...@gmail.com> wrote: > > > This begs the question of what happens if a consumer receives an unknown > > flag value. > > It seems to me that ignoring unknown flags is the primary case to consider > at > this point, since consumers may ignore unknown flags. Since that is the > case, > it seems adding any flag which would break such a consumer would be > tantamount to an ABI breakage. I don't think this can be averted unless all > consumers are required to error out on unknown flag values. > > In the specific case of Utf8View it seems certain that consumers would add > support for the buffer sizes flag simultaneously with adding support for the > new type (since Utf8View is difficult to import otherwise), so any consumer > which would error out on the new flag would already be erroring out on an > unsupported data type. > > > I might be the only person who has implemented > > a deep copy of an ArrowSchema in C, but it does blindly pass along a > > schema's flag value > > I think passing a schema's flag value including unknown flags is an error. > The ABI defines moving structures but does not define deep copying. I think > in order to copy deeply in terms of operations which *are* specified: we > import then export the schema. Since this includes an export step, it > should not > include flags which are not supported by the exporter. > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : > > >> Is this buffer lengths buffer only present if the array type is > > Utf8View? > > > > > > IIUC, the proposal would add the buffer lengths buffer for all types if > > the > > > schema's > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid > > > the special case and that `n_buffers` would continue to be consistent > > with > > > IPC. > > > > This begs the question of what happens if a consumer receives an unknown > > flag value. We haven't specified that unknown flag values should be > > ignored, so a consumer could judiciously choose to error out instead of > > potentially misinterpreting the data. > > > > All in all, personally I'd rather we make a special case for Utf8View > > instead of adding a flag that can lead to worse interoperability. > > > > Regards > > > > Antoine. > >