Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> This begs the question of what happens if a consumer receives an unknown flag 
> value

That's a great point...I might be the only person who has implemented
a deep copy of an ArrowSchema in C, but it does blindly pass along a
schema's flag value (which in the scenario I proposed could lead to a
consumer accessing a pointer that didn't exist).

I do think there is utility in considering buffer sizes more
generically in the future...if it is apparently so essential that
every Arrow implementation implements them in this way, it seems like
an oversight to have producers constantly omitting buffer sizes and
consumers constantly recalculating them.

On Thu, Oct 26, 2023 at 4:35 PM Dewey Dunnington  wrote:
>
> I'm afraid I've derailed the discussion into solving a bigger problem
> than strictly necessary. I don't think this is the time to solve the
> general problem of the C data interface having no way to communicate
> buffer sizes, particularly since there's no immediate agreement on its
> utility or implementation, but perhaps it is possible to solve it in a
> way that does not preclude implementing it in some generic way in the
> future.
>
> I think Ben's initial proposal of incrementing n_buffers by one and
> appending an int64_t* pointing to the buffer sizes accomplishes that,
> so consider me a +1. It might perhaps be more general if it included
> all buffer sizes (not just variadic ones), but given that it would
> only be useful for a few other types I don't think that is a game
> changer.
>
> It is probably also worth noting whether we expect the buffer
> containing the sizes to live on the CPU device always or whether we
> want it to live on the same device as the data buffers.
>
> On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou  wrote:
> >
> >
> > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> > >> Is this buffer lengths buffer only present if the array type is Utf8View?
> > >
> > > IIUC, the proposal would add the buffer lengths buffer for all types if 
> > > the
> > > schema's
> > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > > the special case and that `n_buffers` would continue to be consistent with
> > > IPC.
> >
> > This begs the question of what happens if a consumer receives an unknown
> > flag value. We haven't specified that unknown flag values should be
> > ignored, so a consumer could judiciously choose to error out instead of
> > potentially misinterpreting the data.
> >
> > All in all, personally I'd rather we make a special case for Utf8View
> > instead of adding a flag that can lead to worse interoperability.
> >
> > Regards
> >
> > Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
I'm afraid I've derailed the discussion into solving a bigger problem
than strictly necessary. I don't think this is the time to solve the
general problem of the C data interface having no way to communicate
buffer sizes, particularly since there's no immediate agreement on its
utility or implementation, but perhaps it is possible to solve it in a
way that does not preclude implementing it in some generic way in the
future.

I think Ben's initial proposal of incrementing n_buffers by one and
appending an int64_t* pointing to the buffer sizes accomplishes that,
so consider me a +1. It might perhaps be more general if it included
all buffer sizes (not just variadic ones), but given that it would
only be useful for a few other types I don't think that is a game
changer.

It is probably also worth noting whether we expect the buffer
containing the sizes to live on the CPU device always or whether we
want it to live on the same device as the data buffers.

On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou  wrote:
>
>
> Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :
> >> Is this buffer lengths buffer only present if the array type is Utf8View?
> >
> > IIUC, the proposal would add the buffer lengths buffer for all types if the
> > schema's
> > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
> > the special case and that `n_buffers` would continue to be consistent with
> > IPC.
>
> This begs the question of what happens if a consumer receives an unknown
> flag value. We haven't specified that unknown flag values should be
> ignored, so a consumer could judiciously choose to error out instead of
> potentially misinterpreting the data.
>
> All in all, personally I'd rather we make a special case for Utf8View
> instead of adding a flag that can lead to worse interoperability.
>
> Regards
>
> Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit :

Is this buffer lengths buffer only present if the array type is Utf8View?


IIUC, the proposal would add the buffer lengths buffer for all types if the
schema's
flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
the special case and that `n_buffers` would continue to be consistent with
IPC.


This begs the question of what happens if a consumer receives an unknown 
flag value. We haven't specified that unknown flag values should be 
ignored, so a consumer could judiciously choose to error out instead of 
potentially misinterpreting the data.


All in all, personally I'd rather we make a special case for Utf8View 
instead of adding a flag that can lead to worse interoperability.


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Benjamin Kietzman
> Is this buffer lengths buffer only present if the array type is Utf8View?

IIUC, the proposal would add the buffer lengths buffer for all types if the
schema's
flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid
the special case and that `n_buffers` would continue to be consistent with
IPC.

On Thu, Oct 26, 2023 at 1:35 PM Weston Pace  wrote:

> Is this buffer lengths buffer only present if the array type is Utf8View?
> Or are you suggesting that other types might want to adopt this as well?
>
> On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington
>  wrote:
>
> > > I expect C code to not be much longer then this :-)
> >
> > nanoarrow's buffer-length-calculation and validation concepts are
> > (perhaps inadvisably) intertwined...even with both it is not that much
> > code (perhaps I was remembering how much time it took me to figure out
> > which 35 lines to write :-))
> >
> > > That sounds a bit hackish to me.
> >
> > Including only *some* buffer sizes in array->buffers[array->n_buffers]
> > special-cased for only two types (or altering the number of buffers
> > required by the IPC format vs. the number of buffers required by the C
> > Data interface) seem equally hackish to me (not that I'm opposed to
> > either necessarily...the alternatives really are very bad).
> >
> > > How can you *not* care about buffer sizes, if you for example need to
> > send the buffers over IPC?
> >
> > I think IPC is the *only* operation that requires that information?
> > (Other than perhaps copying to another device?) I don't think there's
> > any barrier to accessing the content of all the array elements but I
> > could be mistaken.
> >
> > On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou 
> wrote:
> > >
> > >
> > > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > > > The lack of buffer sizes is something that has come up for me a few
> > > > times working with nanoarrow (which dedicates a significant amount of
> > > > code to calculating buffer sizes, which it uses to do validation and
> > > > more efficient copying).
> > >
> > > By the way, this is a bit surprising since it's really 35 lines of code
> > > in C++ currently:
> > >
> > >
> >
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
> > >
> > > I expect C code to not be much longer then this :-)
> > >
> > > Regards
> > >
> > > Antoine.
> >
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Weston Pace
Is this buffer lengths buffer only present if the array type is Utf8View?
Or are you suggesting that other types might want to adopt this as well?

On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington
 wrote:

> > I expect C code to not be much longer then this :-)
>
> nanoarrow's buffer-length-calculation and validation concepts are
> (perhaps inadvisably) intertwined...even with both it is not that much
> code (perhaps I was remembering how much time it took me to figure out
> which 35 lines to write :-))
>
> > That sounds a bit hackish to me.
>
> Including only *some* buffer sizes in array->buffers[array->n_buffers]
> special-cased for only two types (or altering the number of buffers
> required by the IPC format vs. the number of buffers required by the C
> Data interface) seem equally hackish to me (not that I'm opposed to
> either necessarily...the alternatives really are very bad).
>
> > How can you *not* care about buffer sizes, if you for example need to
> send the buffers over IPC?
>
> I think IPC is the *only* operation that requires that information?
> (Other than perhaps copying to another device?) I don't think there's
> any barrier to accessing the content of all the array elements but I
> could be mistaken.
>
> On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou  wrote:
> >
> >
> > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > > The lack of buffer sizes is something that has come up for me a few
> > > times working with nanoarrow (which dedicates a significant amount of
> > > code to calculating buffer sizes, which it uses to do validation and
> > > more efficient copying).
> >
> > By the way, this is a bit surprising since it's really 35 lines of code
> > in C++ currently:
> >
> >
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
> >
> > I expect C code to not be much longer then this :-)
> >
> > Regards
> >
> > Antoine.
>


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 18:59, Dewey Dunnington a écrit :



That sounds a bit hackish to me.


Including only *some* buffer sizes in array->buffers[array->n_buffers]
special-cased for only two types (or altering the number of buffers
required by the IPC format vs. the number of buffers required by the C
Data interface) seem equally hackish to me (not that I'm opposed to
either necessarily...the alternatives really are very bad).


I think the plan for Utf8View is that `n_buffers` is incremented to 
reflect that additional buffer at the end.



How can you *not* care about buffer sizes, if you for example need to send the 
buffers over IPC?


I think IPC is the *only* operation that requires that information?
(Other than perhaps copying to another device?) I don't think there's
any barrier to accessing the content of all the array elements but I
could be mistaken.


That's true, but IPC is implemented by all major Arrow implementations, 
AFAIK :-)


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
> I expect C code to not be much longer then this :-)

nanoarrow's buffer-length-calculation and validation concepts are
(perhaps inadvisably) intertwined...even with both it is not that much
code (perhaps I was remembering how much time it took me to figure out
which 35 lines to write :-))

> That sounds a bit hackish to me.

Including only *some* buffer sizes in array->buffers[array->n_buffers]
special-cased for only two types (or altering the number of buffers
required by the IPC format vs. the number of buffers required by the C
Data interface) seem equally hackish to me (not that I'm opposed to
either necessarily...the alternatives really are very bad).

> How can you *not* care about buffer sizes, if you for example need to send 
> the buffers over IPC?

I think IPC is the *only* operation that requires that information?
(Other than perhaps copying to another device?) I don't think there's
any barrier to accessing the content of all the array elements but I
could be mistaken.

On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou  wrote:
>
>
> Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> > The lack of buffer sizes is something that has come up for me a few
> > times working with nanoarrow (which dedicates a significant amount of
> > code to calculating buffer sizes, which it uses to do validation and
> > more efficient copying).
>
> By the way, this is a bit surprising since it's really 35 lines of code
> in C++ currently:
>
> https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754
>
> I expect C code to not be much longer then this :-)
>
> Regards
>
> Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
The lack of buffer sizes is something that has come up for me a few 
times working with nanoarrow (which dedicates a significant amount of 
code to calculating buffer sizes, which it uses to do validation and 
more efficient copying).


By the way, this is a bit surprising since it's really 35 lines of code 
in C++ currently:


https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754

I expect C code to not be much longer then this :-)

Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Antoine Pitrou



Le 26/10/2023 à 17:45, Dewey Dunnington a écrit :
> A potential alternative might be to allow any ArrowArray to declare
> its buffer sizes in array->buffers[array->n_buffers], perhaps with a
> new flag in schema->flags to advertise that capability.

That sounds a bit hackish to me.

I'd rather live with the current arrangement for now. Once enough griefs 
with the C Data Interface accumulate, it will be time to think about a 
new specification.



We might want to keep the variadic buffers at the end and instead export
the buffer sizes as buffer #2? Though that's mostly stylistic...


I would prefer the buffer sizes to be after as it preserves the
connection between Columnar/IPC format and the C Data interface...the
need for buffer_sizes is more of a convenience for implementations
that care about this kind of thing than something inherent to the
array data.


How can you *not* care about buffer sizes, if you for example need to 
send the buffers over IPC?


Regards

Antoine.


Re: [DISCUSS][Format] C data interface for Utf8View

2023-10-26 Thread Dewey Dunnington
Ben kindly explained to me offline that the need for the buffer sizes
is because when Arrow C++ imports an Array it creates Buffer class
wrappers around the imported pointers. Arrow C++ does not have a
notion of a buffer of unknown size to my knowledge, which leaves two
undesirable alternatives: (1) loop over every string to calculate the
maximum referenced buffer size for each buffer or (2) overhaul the
Buffer class to allow unknown buffer sizes and suffer the
corresponding performance/support issues when doing something with the
array data that would otherwise be type-agnostic (e.g., converting to
IPC).

The lack of buffer sizes is something that has come up for me a few
times working with nanoarrow (which dedicates a significant amount of
code to calculating buffer sizes, which it uses to do validation and
more efficient copying). The most recent issue I have had was when
implementing the Arrow C Device Interface: for string and binary (+
the large counterparts) it is necessary to access the buffers to
calculate the sizes, which makes it difficult to write
generic/performant code copying an entire array between devices.

A potential alternative might be to allow any ArrowArray to declare
its buffer sizes in array->buffers[array->n_buffers], perhaps with a
new flag in schema->flags to advertise that capability. I'm happy to
defer that discussion to another time but if there is no opposition,
it might be cleaner to include sooner than later (because it does not
involve special-casing specific types).

> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...

I would prefer the buffer sizes to be after as it preserves the
connection between Columnar/IPC format and the C Data interface...the
need for buffer_sizes is more of a convenience for implementations
that care about this kind of thing than something inherent to the
array data.

Cheers!

-dewey

On Wed, Oct 25, 2023 at 1:47 PM Antoine Pitrou  wrote:
>
>
> Hello,
>
> We might want to keep the variadic buffers at the end and instead export
> the buffer sizes as buffer #2? Though that's mostly stylistic...
>
> Regards
>
> Antoine.
>
>
> Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit :
> > Hello all,
> >
> > The C ABI does not store buffer lengths explicitly, which presents a
> > problem for Utf8View since buffer lengths are not trivially extractable
> > from other data in the array. A potential solution is to store the lengths
> > in an extra buffer after the variadic data buffers. I've adopted this
> > approach in my (currently draft) PR [1] to add c++ library import/export
> > for Utf8VIew, but I thought this warranted raising on the ML in case anyone
> > has a better idea.
> >
> > Sincerely,
> > Ben Kietzman
> >
> > [1]
> > https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549
> >


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-26 Thread Benjamin Kietzman
Congratulations!

On Thu, Oct 26, 2023 at 10:35 AM Dane Pitkin 
wrote:

> Congratulations, Xuwei!
>
> On Thu, Oct 26, 2023 at 9:34 AM Joris Van den Bossche <
> jorisvandenboss...@gmail.com> wrote:
>
> > Congrats!
> >
> > On Wed, 25 Oct 2023 at 08:23, Ian Joiner  wrote:
> > >
> > > Congrats!
> > >
> > > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei 
> wrote:
> > >
> > > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > > > has accepted an invitation to become a committer on Apache
> > > > Arrow. Welcome, and thank you for your contributions!
> > > >
> > > > --
> > > > kou
> > > >
> >
>


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-26 Thread Dane Pitkin
Congratulations, Xuwei!

On Thu, Oct 26, 2023 at 9:34 AM Joris Van den Bossche <
jorisvandenboss...@gmail.com> wrote:

> Congrats!
>
> On Wed, 25 Oct 2023 at 08:23, Ian Joiner  wrote:
> >
> > Congrats!
> >
> > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei  wrote:
> >
> > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > > has accepted an invitation to become a committer on Apache
> > > Arrow. Welcome, and thank you for your contributions!
> > >
> > > --
> > > kou
> > >
>


Re: [ANNOUNCE] New Arrow committer: Xuwei Fu

2023-10-26 Thread Joris Van den Bossche
Congrats!

On Wed, 25 Oct 2023 at 08:23, Ian Joiner  wrote:
>
> Congrats!
>
> On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei  wrote:
>
> > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu
> > has accepted an invitation to become a committer on Apache
> > Arrow. Welcome, and thank you for your contributions!
> >
> > --
> > kou
> >