Do you have any benchmarks comparing kernels with native pointer array support, compared to those that must first convert to the offset representation? I think this would help ground this discussion empirically.
On 27 September 2023 22:25:02 BST, Benjamin Kietzman <bengil...@gmail.com> wrote: >Hello all, > >@Gang >> Could you please simply describe the layout of DuckDB and Velox > >Arrow represents long (>12 bytes) strings with a view which includes >a buffer index (used to look up one of the variadic data buffers) >and an offset (used to find the start of a string's bytes within the >indicated buffer). DuckDB and Velox by contrast have a raw pointer >directly to the start of the string's bytes. Since these occupy the >same 8 bytes of a view, it's possible and fairly efficient to convert >from one representation to the other by modifying those 8 bytes in place. > >@Raphael >> Is the motivation here to avoid DuckDB and Velox having to duplicate the >conversion logic from pointer-based to offset-based, or to allow >arrow-cpp to operate directly on pointer-based arrays? > >It's more the latter; arrow C++ is intended to be useful as more than an IPC >serializer/deserializer, so it is beneficial to be able to import arrays >and also operate on them with no conversion cost. However it's also worth >noting that the raw pointer representation is more efficient on access, >albeit more expensive to validate along with a number of other tradeoffs. >In order to progress this work, I took this hybrid approach in part to defer >the question of which representation is preferred in which context. I would >like to allow the C++ library freedom to extract as much performance from >this type as possible, internally as well as when communicating with other >engines. > >@Antoine >> What this PR is creating is an "unofficial" Arrow format, with data >types exposed in Arrow C++ that are not part of the Arrow standard, but >are exposed as if they were. > >We already do this in every implementation of the arrow format I'm >aware of: it's more convenient to consider dictionary as a data type >even though the spec says that it is a field property. I don't think >it's illegal or unreasonable for an implementation to diverge in their >internal handling of arrow data (whether to achieve performance, >consistency, or convenience). > >> I'm not sure how DuckDB and Velox data could be exposed, but it could be >for example an extension type with a fixed_size_binary<16> storage type. > >This wouldn't allow for the transmission of the variadic data buffers >which (even in the presence of raw pointer views) are necessary to >guarantee the lifetime of string data in the vector. Alternatively we >could use Utf8View with the high and low bits of the raw pointer >packed into the index and offset, but I don't think this would be less >tantamount to an unofficial arrow format. > >Sincerely, >Ben Kietzman > > >On Wed, Sep 27, 2023 at 2:51 AM Antoine Pitrou <anto...@python.org> wrote: > >> >> Hello, >> >> What this PR is creating is an "unofficial" Arrow format, with data >> types exposed in Arrow C++ that are not part of the Arrow standard, but >> are exposed as if they were. Most users will probably not read the >> official format spec, but will simply trust the official Arrow >> implementations. So the official Arrow implementations have an >> obligation to faithfully represent the Arrow format and not breed >> confusion. >> >> So I'm -1 on the way the PR presents things currently. >> >> I'm not sure how DuckDB and Velox data could be exposed, but it could be >> for example an extension type with a fixed_size_binary<16> storage type. >> >> Regards >> >> Antoine. >> >> >> >> Le 26/09/2023 à 22:34, Benjamin Kietzman a écrit : >> > Hello all, >> > >> > In the PR to add support for Utf8View to the c++ implementation, >> > I've taken the approach of allowing raw pointer views [1] alongside the >> > index/offset views described in the spec [2]. This was done to ease >> > communication with other engines such as DuckDB and Velox whose native >> > string representation is the raw pointer view. In order to be usable >> > as a utility for writing IPC files and other operations on arrow >> > formatted data, it is useful for the library to be able to directly >> > import raw pointer arrays even when immediately converting these to >> > the index/offset representation. >> > >> > However there has been objection in review [3] since the raw pointer >> > representation is not part of the official format. Since data visitation >> > utilities are generic, IMHO this hybrid approach does not add >> > significantly to the complexity of the C++ library, and I feel the >> > aforementioned interoperability is a high priority when adding this >> > feature to the C++ library. It's worth noting that this interoperability >> > has been a stated goal of the Utf8Type since its original proposal [4] >> > and throughout the discussion of its adoption [5]. >> > >> > Sincerely, >> > Ben Kietzman >> > >> > [1]: >> > >> https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 >> > [2]: >> > >> https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 >> > [3]: https://github.com/apache/arrow/pull/37792#discussion_r1336010665 >> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq >> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 >> > >>