Re: [DISCUSS][C++] Raw pointer string views

Raphael Taylor-Davies Wed, 27 Sep 2023 15:13:06 -0700

Do you have any benchmarks comparing kernels with native pointer array support, 
compared to those that must first convert to the offset representation? I think 
this would help ground this discussion empirically.


On 27 September 2023 22:25:02 BST, Benjamin Kietzman <bengil...@gmail.com> 
wrote:
>Hello all,
>
>@Gang
>> Could you please simply describe the layout of DuckDB and Velox
>
>Arrow represents long (>12 bytes) strings with a view which includes
>a buffer index (used to look up one of the variadic data buffers)
>and an offset (used to find the start of a string's bytes within the
>indicated buffer). DuckDB and Velox by contrast have a raw pointer
>directly to the start of the string's bytes. Since these occupy the
>same 8 bytes of a view, it's possible and fairly efficient to convert
>from one representation to the other by modifying those 8 bytes in place.
>
>@Raphael
>> Is the motivation here to avoid DuckDB and Velox having to duplicate the
>conversion logic from pointer-based to offset-based, or to allow
>arrow-cpp to operate directly on pointer-based arrays?
>
>It's more the latter; arrow C++ is intended to be useful as more than an IPC
>serializer/deserializer, so it is beneficial to be able to import arrays
>and also operate on them with no conversion cost. However it's also worth
>noting that the raw pointer representation is more efficient on access,
>albeit more expensive to validate along with a number of other tradeoffs.
>In order to progress this work, I took this hybrid approach in part to defer
>the question of which representation is preferred in which context. I would
>like to allow the C++ library freedom to extract as much performance from
>this type as possible, internally as well as when communicating with other
>engines.
>
>@Antoine
>> What this PR is creating is an "unofficial" Arrow format, with data
>types exposed in Arrow C++ that are not part of the Arrow standard, but
>are exposed as if they were.
>
>We already do this in every implementation of the arrow format I'm
>aware of: it's more convenient to consider dictionary as a data type
>even though the spec says that it is a field property. I don't think
>it's illegal or unreasonable for an implementation to diverge in their
>internal handling of arrow data (whether to achieve performance,
>consistency, or convenience).
>
>> I'm not sure how DuckDB and Velox data could be exposed, but it could be
>for example an extension type with a fixed_size_binary<16> storage type.
>
>This wouldn't allow for the transmission of the variadic data buffers
>which (even in the presence of raw pointer views) are necessary to
>guarantee the lifetime of string data in the vector. Alternatively we
>could use Utf8View with the high and low bits of the raw pointer
>packed into the index and offset, but I don't think this would be less
>tantamount to an unofficial arrow format.
>
>Sincerely,
>Ben Kietzman
>
>
>On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou <anto...@python.org> wrote:
>
>>
>> Hello,
>>
>> What this PR is creating is an "unofficial" Arrow format, with data
>> types exposed in Arrow C++ that are not part of the Arrow standard, but
>> are exposed as if they were. Most users will probably not read the
>> official format spec, but will simply trust the official Arrow
>> implementations. So the official Arrow implementations have an
>> obligation to faithfully represent the Arrow format and not breed
>> confusion.
>>
>> So I'm -1 on the way the PR presents things currently.
>>
>> I'm not sure how DuckDB and Velox data could be exposed, but it could be
>> for example an extension type with a fixed_size_binary<16> storage type.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>
>> Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit :
>> > Hello all,
>> >
>> > In the PR to add support for Utf8View to the c++ implementation,
>> > I've taken the approach of allowing raw pointer views [1] alongside the
>> > index/offset views described in the spec [2]. This was done to ease
>> > communication with other engines such as DuckDB and Velox whose native
>> > string representation is the raw pointer view. In order to be usable
>> > as a utility for writing IPC files and other operations on arrow
>> > formatted data, it is useful for the library to be able to directly
>> > import raw pointer arrays even when immediately converting these to
>> > the index/offset representation.
>> >
>> > However there has been objection in review [3] since the raw pointer
>> > representation is not part of the official format. Since data visitation
>> > utilities are generic, IMHO this hybrid approach does not add
>> > significantly to the complexity of the C++ library, and I feel the
>> > aforementioned interoperability is a high priority when adding this
>> > feature to the C++ library. It's worth noting that this interoperability
>> > has been a stated goal of the Utf8Type since its original proposal [4]
>> > and throughout the discussion of its adoption [5].
>> >
>> > Sincerely,
>> > Ben Kietzman
>> >
>> > [1]:
>> >
>> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
>> > [2]:
>> >
>> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
>> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
>> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
>> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4
>> >
>>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to