Ben kindly explained to me offline that the need for the buffer sizes
is because when Arrow C++ imports an Array it creates Buffer class
wrappers around the imported pointers. Arrow C++ does not have a
notion of a buffer of unknown size to my knowledge, which leaves two
undesirable alternatives: (1) loop over every string to calculate the
maximum referenced buffer size for each buffer or (2) overhaul the
Buffer class to allow unknown buffer sizes and suffer the
corresponding performance/support issues when doing something with the
array data that would otherwise be type-agnostic (e.g., converting to
IPC).

The lack of buffer sizes is something that has come up for me a few
times working with nanoarrow (which dedicates a significant amount of
code to calculating buffer sizes, which it uses to do validation and
more efficient copying). The most recent issue I have had was when
implementing the Arrow C Device Interface: for string and binary (+
the large counterparts) it is necessary to access the buffers to
calculate the sizes, which makes it difficult to write
generic/performant code copying an entire array between devices.

A potential alternative might be to allow any ArrowArray to declare
its buffer sizes in array->buffers[array->n_buffers], perhaps with a
new flag in schema->flags to advertise that capability. I'm happy to
defer that discussion to another time but if there is no opposition,
it might be cleaner to include sooner than later (because it does not
involve special-casing specific types).

> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...

I would prefer the buffer sizes to be after as it preserves the
connection between Columnar/IPC format and the C Data interface...the
need for buffer_sizes is more of a convenience for implementations
that care about this kind of thing than something inherent to the
array data.

Cheers!

-dewey

On Wed, Oct 25, 2023 at 1:47 PM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hello,
>
> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...
>
> Regards
>
> Antoine.
>
>
> Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > The C ABI does not store buffer lengths explicitly, which presents a
> > problem for Utf8View since buffer lengths are not trivially extractable
> > from other data in the array. A potential solution is to store the lengths
> > in an extra buffer after the variadic data buffers. I've adopted this
> > approach in my (currently draft) PR [1] to add c++ library import/export
> > for Utf8VIew, but I thought this warranted raising on the ML in case anyone
> > has a better idea.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]
> > https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549
> >

Reply via email to