Hello all,

In the PR to add support for Utf8View to the c++ implementation,
I've taken the approach of allowing raw pointer views [1] alongside the
index/offset views described in the spec [2]. This was done to ease
communication with other engines such as DuckDB and Velox whose native
string representation is the raw pointer view. In order to be usable
as a utility for writing IPC files and other operations on arrow
formatted data, it is useful for the library to be able to directly
import raw pointer arrays even when immediately converting these to
the index/offset representation.

However there has been objection in review [3] since the raw pointer
representation is not part of the official format. Since data visitation
utilities are generic, IMHO this hybrid approach does not add
significantly to the complexity of the C++ library, and I feel the
aforementioned interoperability is a high priority when adding this
feature to the C++ library. It's worth noting that this interoperability
has been a stated goal of the Utf8Type since its original proposal [4]
and throughout the discussion of its adoption [5].

Sincerely,
Ben Kietzman

[1]:
https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752
[2]:
https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379
[3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665
[4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4

Reply via email to