[DISCUSS][Format] Draft implementation of string view array format

Benjamin Kietzman Tue, 16 May 2023 14:39:25 -0700

Hello all,

As previously discussed on this list [1], an UmbraDB/DuckDB/Velox compatible
"string view" type could bring several performance benefits to access and
authoring of string data in the arrow format [2]. Additionally better
interoperability with engines already using this format could be
established.


PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and
to
the IPC format. For the purposes of IPC raw pointers are not used. Instead,
each view contains a pair of 32 bit unsigned integers which encode the
index of
a character buffer (string view arrays may consist of a variable number of
such buffers) and the offset of a view's data within that buffer
respectively.
Benefits of this substitution include:
- This makes explicit the guarantee that lifetime of all character data is
equal
  to that of the array which views it, which is critical for confident
  consumption across an interface boundary.
- As with other types in the arrow format, such arrays are serializable and
  venue agnostic; directly usable in shared memory without modification.
- Indices and offsets are easily validated.

Accessing the data requires some trivial pointer arithmetic, but in
benchmarking
this had negligible impact on sequential access and only minor impact on
random
access.

In the C++ implementation, raw pointer string views are supported as an
extended
case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`.
Branching on
this access pattern bit at the data type level has negligible impact on
performance since the branch resides outside any hot loops. Utility
functions
are provided for efficient (potentially in-place) conversion between raw
pointer
and index offset views. For example, the C++ implementation could zero copy
a raw pointer array from Velox, filter it, then convert to index/offset for
serialization. Other implementations may choose to accommodate or eschew raw
pointer views as their communities direct.

Where desirous in a rigorously controlled context this still enables
construction
and safe consumption of string view arrays which reference memory not
directly bound to the lifetime of the array. I'm not sure when or if we
would
find it useful to have arrays like this; I do not introduce any in [3]. I
mention
this possibility to highlight that if benchmarking demonstrates that such an
approach brings a significant performance benefit to some operation, the
only
barrier to its adoption would be code review. Likewise if more intensive
benchmarking determines that raw pointer views critically outperform
index/offset
views for real-world analytics tasks, prioritizing raw pointer string views
for usage within the C++ implementation will be straightforward.

See also the proposal to Velox that their string view vector be refactored
in a similar vein [4].

Sincerely,
Ben Kietzman

[1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq
[2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf
[3] https://github.com/apache/arrow/pull/35628
[4] https://github.com/facebookincubator/velox/discussions/4362

[DISCUSS][Format] Draft implementation of string view array format

Reply via email to