Re: [DISCUSS][Format] C data interface for Utf8View
> This begs the question of what happens if a consumer receives an unknown flag > value That's a great point...I might be the only person who has implemented a deep copy of an ArrowSchema in C, but it does blindly pass along a schema's flag value (which in the scenario I proposed could lead to a consumer accessing a pointer that didn't exist). I do think there is utility in considering buffer sizes more generically in the future...if it is apparently so essential that every Arrow implementation implements them in this way, it seems like an oversight to have producers constantly omitting buffer sizes and consumers constantly recalculating them. On Thu, Oct 26, 2023 at 4:35 PM Dewey Dunnington wrote: > > I'm afraid I've derailed the discussion into solving a bigger problem > than strictly necessary. I don't think this is the time to solve the > general problem of the C data interface having no way to communicate > buffer sizes, particularly since there's no immediate agreement on its > utility or implementation, but perhaps it is possible to solve it in a > way that does not preclude implementing it in some generic way in the > future. > > I think Ben's initial proposal of incrementing n_buffers by one and > appending an int64_t* pointing to the buffer sizes accomplishes that, > so consider me a +1. It might perhaps be more general if it included > all buffer sizes (not just variadic ones), but given that it would > only be useful for a few other types I don't think that is a game > changer. > > It is probably also worth noting whether we expect the buffer > containing the sizes to live on the CPU device always or whether we > want it to live on the same device as the data buffers. > > On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou wrote: > > > > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : > > >> Is this buffer lengths buffer only present if the array type is Utf8View? > > > > > > IIUC, the proposal would add the buffer lengths buffer for all types if > > > the > > > schema's > > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid > > > the special case and that `n_buffers` would continue to be consistent with > > > IPC. > > > > This begs the question of what happens if a consumer receives an unknown > > flag value. We haven't specified that unknown flag values should be > > ignored, so a consumer could judiciously choose to error out instead of > > potentially misinterpreting the data. > > > > All in all, personally I'd rather we make a special case for Utf8View > > instead of adding a flag that can lead to worse interoperability. > > > > Regards > > > > Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
I'm afraid I've derailed the discussion into solving a bigger problem than strictly necessary. I don't think this is the time to solve the general problem of the C data interface having no way to communicate buffer sizes, particularly since there's no immediate agreement on its utility or implementation, but perhaps it is possible to solve it in a way that does not preclude implementing it in some generic way in the future. I think Ben's initial proposal of incrementing n_buffers by one and appending an int64_t* pointing to the buffer sizes accomplishes that, so consider me a +1. It might perhaps be more general if it included all buffer sizes (not just variadic ones), but given that it would only be useful for a few other types I don't think that is a game changer. It is probably also worth noting whether we expect the buffer containing the sizes to live on the CPU device always or whether we want it to live on the same device as the data buffers. On Thu, Oct 26, 2023 at 4:34 PM Antoine Pitrou wrote: > > > Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : > >> Is this buffer lengths buffer only present if the array type is Utf8View? > > > > IIUC, the proposal would add the buffer lengths buffer for all types if the > > schema's > > flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid > > the special case and that `n_buffers` would continue to be consistent with > > IPC. > > This begs the question of what happens if a consumer receives an unknown > flag value. We haven't specified that unknown flag values should be > ignored, so a consumer could judiciously choose to error out instead of > potentially misinterpreting the data. > > All in all, personally I'd rather we make a special case for Utf8View > instead of adding a flag that can lead to worse interoperability. > > Regards > > Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
Le 26/10/2023 à 20:02, Benjamin Kietzman a écrit : Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the special case and that `n_buffers` would continue to be consistent with IPC. This begs the question of what happens if a consumer receives an unknown flag value. We haven't specified that unknown flag values should be ignored, so a consumer could judiciously choose to error out instead of potentially misinterpreting the data. All in all, personally I'd rather we make a special case for Utf8View instead of adding a flag that can lead to worse interoperability. Regards Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
> Is this buffer lengths buffer only present if the array type is Utf8View? IIUC, the proposal would add the buffer lengths buffer for all types if the schema's flags include ARROW_FLAG_BUFFER_LENGTHS. I do find it appealing to avoid the special case and that `n_buffers` would continue to be consistent with IPC. On Thu, Oct 26, 2023 at 1:35 PM Weston Pace wrote: > Is this buffer lengths buffer only present if the array type is Utf8View? > Or are you suggesting that other types might want to adopt this as well? > > On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington > wrote: > > > > I expect C code to not be much longer then this :-) > > > > nanoarrow's buffer-length-calculation and validation concepts are > > (perhaps inadvisably) intertwined...even with both it is not that much > > code (perhaps I was remembering how much time it took me to figure out > > which 35 lines to write :-)) > > > > > That sounds a bit hackish to me. > > > > Including only *some* buffer sizes in array->buffers[array->n_buffers] > > special-cased for only two types (or altering the number of buffers > > required by the IPC format vs. the number of buffers required by the C > > Data interface) seem equally hackish to me (not that I'm opposed to > > either necessarily...the alternatives really are very bad). > > > > > How can you *not* care about buffer sizes, if you for example need to > > send the buffers over IPC? > > > > I think IPC is the *only* operation that requires that information? > > (Other than perhaps copying to another device?) I don't think there's > > any barrier to accessing the content of all the array elements but I > > could be mistaken. > > > > On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou > wrote: > > > > > > > > > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > > > > The lack of buffer sizes is something that has come up for me a few > > > > times working with nanoarrow (which dedicates a significant amount of > > > > code to calculating buffer sizes, which it uses to do validation and > > > > more efficient copying). > > > > > > By the way, this is a bit surprising since it's really 35 lines of code > > > in C++ currently: > > > > > > > > > https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754 > > > > > > I expect C code to not be much longer then this :-) > > > > > > Regards > > > > > > Antoine. > > >
Re: [DISCUSS][Format] C data interface for Utf8View
Is this buffer lengths buffer only present if the array type is Utf8View? Or are you suggesting that other types might want to adopt this as well? On Thu, Oct 26, 2023 at 10:00 AM Dewey Dunnington wrote: > > I expect C code to not be much longer then this :-) > > nanoarrow's buffer-length-calculation and validation concepts are > (perhaps inadvisably) intertwined...even with both it is not that much > code (perhaps I was remembering how much time it took me to figure out > which 35 lines to write :-)) > > > That sounds a bit hackish to me. > > Including only *some* buffer sizes in array->buffers[array->n_buffers] > special-cased for only two types (or altering the number of buffers > required by the IPC format vs. the number of buffers required by the C > Data interface) seem equally hackish to me (not that I'm opposed to > either necessarily...the alternatives really are very bad). > > > How can you *not* care about buffer sizes, if you for example need to > send the buffers over IPC? > > I think IPC is the *only* operation that requires that information? > (Other than perhaps copying to another device?) I don't think there's > any barrier to accessing the content of all the array elements but I > could be mistaken. > > On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou wrote: > > > > > > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > > > The lack of buffer sizes is something that has come up for me a few > > > times working with nanoarrow (which dedicates a significant amount of > > > code to calculating buffer sizes, which it uses to do validation and > > > more efficient copying). > > > > By the way, this is a bit surprising since it's really 35 lines of code > > in C++ currently: > > > > > https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754 > > > > I expect C code to not be much longer then this :-) > > > > Regards > > > > Antoine. >
Re: [DISCUSS][Format] C data interface for Utf8View
Le 26/10/2023 à 18:59, Dewey Dunnington a écrit : That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the C Data interface) seem equally hackish to me (not that I'm opposed to either necessarily...the alternatives really are very bad). I think the plan for Utf8View is that `n_buffers` is incremented to reflect that additional buffer at the end. How can you *not* care about buffer sizes, if you for example need to send the buffers over IPC? I think IPC is the *only* operation that requires that information? (Other than perhaps copying to another device?) I don't think there's any barrier to accessing the content of all the array elements but I could be mistaken. That's true, but IPC is implemented by all major Arrow implementations, AFAIK :-) Regards Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
> I expect C code to not be much longer then this :-) nanoarrow's buffer-length-calculation and validation concepts are (perhaps inadvisably) intertwined...even with both it is not that much code (perhaps I was remembering how much time it took me to figure out which 35 lines to write :-)) > That sounds a bit hackish to me. Including only *some* buffer sizes in array->buffers[array->n_buffers] special-cased for only two types (or altering the number of buffers required by the IPC format vs. the number of buffers required by the C Data interface) seem equally hackish to me (not that I'm opposed to either necessarily...the alternatives really are very bad). > How can you *not* care about buffer sizes, if you for example need to send > the buffers over IPC? I think IPC is the *only* operation that requires that information? (Other than perhaps copying to another device?) I don't think there's any barrier to accessing the content of all the array elements but I could be mistaken. On Thu, Oct 26, 2023 at 1:04 PM Antoine Pitrou wrote: > > > Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > > The lack of buffer sizes is something that has come up for me a few > > times working with nanoarrow (which dedicates a significant amount of > > code to calculating buffer sizes, which it uses to do validation and > > more efficient copying). > > By the way, this is a bit surprising since it's really 35 lines of code > in C++ currently: > > https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754 > > I expect C code to not be much longer then this :-) > > Regards > > Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). By the way, this is a bit surprising since it's really 35 lines of code in C++ currently: https://github.com/apache/arrow/blob/57f643c2cecca729109daae18c7a64f3a37e76e4/cpp/src/arrow/c/bridge.cc#L1721-L1754 I expect C code to not be much longer then this :-) Regards Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
Le 26/10/2023 à 17:45, Dewey Dunnington a écrit : > A potential alternative might be to allow any ArrowArray to declare > its buffer sizes in array->buffers[array->n_buffers], perhaps with a > new flag in schema->flags to advertise that capability. That sounds a bit hackish to me. I'd rather live with the current arrangement for now. Once enough griefs with the C Data Interface accumulate, it will be time to think about a new specification. We might want to keep the variadic buffers at the end and instead export the buffer sizes as buffer #2? Though that's mostly stylistic... I would prefer the buffer sizes to be after as it preserves the connection between Columnar/IPC format and the C Data interface...the need for buffer_sizes is more of a convenience for implementations that care about this kind of thing than something inherent to the array data. How can you *not* care about buffer sizes, if you for example need to send the buffers over IPC? Regards Antoine.
Re: [DISCUSS][Format] C data interface for Utf8View
Ben kindly explained to me offline that the need for the buffer sizes is because when Arrow C++ imports an Array it creates Buffer class wrappers around the imported pointers. Arrow C++ does not have a notion of a buffer of unknown size to my knowledge, which leaves two undesirable alternatives: (1) loop over every string to calculate the maximum referenced buffer size for each buffer or (2) overhaul the Buffer class to allow unknown buffer sizes and suffer the corresponding performance/support issues when doing something with the array data that would otherwise be type-agnostic (e.g., converting to IPC). The lack of buffer sizes is something that has come up for me a few times working with nanoarrow (which dedicates a significant amount of code to calculating buffer sizes, which it uses to do validation and more efficient copying). The most recent issue I have had was when implementing the Arrow C Device Interface: for string and binary (+ the large counterparts) it is necessary to access the buffers to calculate the sizes, which makes it difficult to write generic/performant code copying an entire array between devices. A potential alternative might be to allow any ArrowArray to declare its buffer sizes in array->buffers[array->n_buffers], perhaps with a new flag in schema->flags to advertise that capability. I'm happy to defer that discussion to another time but if there is no opposition, it might be cleaner to include sooner than later (because it does not involve special-casing specific types). > We might want to keep the variadic buffers at the end and instead export > the buffer sizes as buffer #2? Though that's mostly stylistic... I would prefer the buffer sizes to be after as it preserves the connection between Columnar/IPC format and the C Data interface...the need for buffer_sizes is more of a convenience for implementations that care about this kind of thing than something inherent to the array data. Cheers! -dewey On Wed, Oct 25, 2023 at 1:47 PM Antoine Pitrou wrote: > > > Hello, > > We might want to keep the variadic buffers at the end and instead export > the buffer sizes as buffer #2? Though that's mostly stylistic... > > Regards > > Antoine. > > > Le 25/10/2023 à 18:36, Benjamin Kietzman a écrit : > > Hello all, > > > > The C ABI does not store buffer lengths explicitly, which presents a > > problem for Utf8View since buffer lengths are not trivially extractable > > from other data in the array. A potential solution is to store the lengths > > in an extra buffer after the variadic data buffers. I've adopted this > > approach in my (currently draft) PR [1] to add c++ library import/export > > for Utf8VIew, but I thought this warranted raising on the ML in case anyone > > has a better idea. > > > > Sincerely, > > Ben Kietzman > > > > [1] > > https://github.com/bkietz/arrow/compare/37710-cxx-impl-string-view..36099-string-view-c-abi#diff-3907fc8e8c9fa4ed7268f6baa5b919e8677fb99947b7384a9f8f001174ab66eaR549 > >
Re: [ANNOUNCE] New Arrow committer: Xuwei Fu
Congratulations! On Thu, Oct 26, 2023 at 10:35 AM Dane Pitkin wrote: > Congratulations, Xuwei! > > On Thu, Oct 26, 2023 at 9:34 AM Joris Van den Bossche < > jorisvandenboss...@gmail.com> wrote: > > > Congrats! > > > > On Wed, 25 Oct 2023 at 08:23, Ian Joiner wrote: > > > > > > Congrats! > > > > > > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei > wrote: > > > > > > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu > > > > has accepted an invitation to become a committer on Apache > > > > Arrow. Welcome, and thank you for your contributions! > > > > > > > > -- > > > > kou > > > > > > >
Re: [ANNOUNCE] New Arrow committer: Xuwei Fu
Congratulations, Xuwei! On Thu, Oct 26, 2023 at 9:34 AM Joris Van den Bossche < jorisvandenboss...@gmail.com> wrote: > Congrats! > > On Wed, 25 Oct 2023 at 08:23, Ian Joiner wrote: > > > > Congrats! > > > > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei wrote: > > > > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu > > > has accepted an invitation to become a committer on Apache > > > Arrow. Welcome, and thank you for your contributions! > > > > > > -- > > > kou > > > >
Re: [ANNOUNCE] New Arrow committer: Xuwei Fu
Congrats! On Wed, 25 Oct 2023 at 08:23, Ian Joiner wrote: > > Congrats! > > On Mon, Oct 23, 2023 at 2:33 AM Sutou Kouhei wrote: > > > On behalf of the Arrow PMC, I'm happy to announce that Xuwei Fu > > has accepted an invitation to become a committer on Apache > > Arrow. Welcome, and thank you for your contributions! > > > > -- > > kou > >