Re: [DISCUSS][Format] C data interface for Utf8View

Andrew Lamb Wed, 15 Nov 2023 16:46:05 -0800

Given the constraints of not changing the existing struct definitions,
adding a new buffer seems like the only way forward from what I understand.
It is unfortunate that each array now needs need a new allocation (the
buffer lengths) when passing via FFI, but I don't have any other
suggestions unfortunately


Andrew

On Tue, Nov 7, 2023 at 5:46 PM Weston Pace <[email protected]> wrote:

> +1 for the original proposal as well.
>
> ---
>
> The (minor) problem I see with flags is that there isn't much point to this
> feature if you are gating on a flag.  I'm assuming the goal is what Dewey
> originally mentioned which is making buffer calculations easier.  However,
> if you're gating the feature with a flag then you are either:
>
>  * Rejecting input from producers that don't support this feature
> (undesirable, better to align on one use model if we can)
>  * Doing all the work anyways to handle producers that don't support the
> feature
>
> Maybe it makes sense for a long term migration (e.g. we all agree this is
> something we want to move towards but we need to handle old producers in
> the meantime) but we can always discuss that separately and I don't think
> the benefit here is worth the confusion.
>
> On Tue, Nov 7, 2023 at 7:46 AM Will Jones <[email protected]> wrote:
>
> > I agree with the approach originally proposed by Ben. It seems like the
> > most straightforward way to implement within the current protocol.
> >
> > On Sun, Oct 29, 2023 at 4:59 PM Dewey Dunnington
> > <[email protected]> wrote:
> >
> > > In the absence of a general solution to the C data interface omitting
> > > buffer sizes, I think the original proposal is the best way
> > > forward...this is the first type to be added whose buffer sizes cannot
> > > be calculated without looping over every element of the array; the
> > > buffer sizes are needed to efficiently serialize the imported array to
> > > IPC if imported by a consumer that cares about buffer sizes.
> > >
> > > Using a schema's flags to indicate something about a specific paired
> > > array (particularly one that, if misinterpreted, would lead to a
> > > crash) is a precedent that is probably not worth introducing for just
> > > one type. Currently a schema is completely independent of any
> > > particular ArrowArray, and I think that is a feature that is worth
> > > preserving. My gripes about not having buffer sizes on the CPU to more
> > > efficiently copy between devices is a concept almost certainly better
> > > suited to the ArrowDeviceArray struct.
> > >
> > > On Fri, Oct 27, 2023 at 12:45 PM Benjamin Kietzman <
> [email protected]>
> > > wrote:
> > > >
> > > > > This begs the question of what happens if a consumer receives an
> > > unknown
> > > > > flag value.
> > > >
> > > > It seems to me that ignoring unknown flags is the primary case to
> > > consider
> > > > at
> > > > this point, since consumers may ignore unknown flags. Since that is
> the
> > > > case,
> > > > it seems adding any flag which would break such a consumer would be
> > > > tantamount to an ABI breakage. I don't think this can be averted
> unless
> > > all
> > > > consumers are required to error out on unknown flag values.
> > > >
> > > > In the specific case of Utf8View it seems certain that consumers
> would
> > > add
> > > > support for the buffer sizes flag simultaneously with adding support
> > for
> > > the
> > > > new type (since Utf8View is difficult to import otherwise), so any
> > > consumer
> > > > which would error out on the new flag would already be erroring out
> on
> > an
> > > > unsupported data type.
> > > >
> > > > > I might be the only person who has implemented
> > > > > a deep copy of an ArrowSchema in C, but it does blindly pass along
> a
> > > > > schema's flag value
> > > >
> > > > I think passing a schema's flag value including unknown flags is an
> > > error.
> > > > The ABI defines moving structures but does not define deep copying. I
> > > think
> > > > in order to copy deeply in terms of operations which *are* specified:
> > we
> > > > import then export the schema. Since this includes an export step, it
> > > > should not
> > > > include flags which are not supported by the exporter.
> > > >
> > > > On Thu, Oct 26, 2023 at 6:40 PM Antoine Pitrou <[email protected]>
> > > wrote:
> > > >
> > > > >
> > > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > > > > >> Is this buffer lengths buffer only present if the array type is
> > > > > Utf8View?
> > > > > >
> > > > > > IIUC, the proposal would add the buffer lengths buffer for all
> > types
> > > if
> > > > > the
> > > > > > schema's
> > > > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing
> to
> > > avoid
> > > > > > the special case and that `n_buffers` would continue to be
> > consistent
> > > > > with
> > > > > > IPC.
> > > > >
> > > > > This begs the question of what happens if a consumer receives an
> > > unknown
> > > > > flag value. We haven't specified that unknown flag values should be
> > > > > ignored, so a consumer could judiciously choose to error out
> instead
> > of
> > > > > potentially misinterpreting the data.
> > > > >
> > > > > All in all, personally I'd rather we make a special case for
> Utf8View
> > > > > instead of adding a flag that can lead to worse interoperability.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > >
> >
>

Re: [DISCUSS][Format] C data interface for Utf8View

Reply via email to