Hello all, As previously discussed on this list [1], an UmbraDB/DuckDB/Velox compatible "string view" type could bring several performance benefits to access and authoring of string data in the arrow format [2]. Additionally better interoperability with engines already using this format could be established.
PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and to the IPC format. For the purposes of IPC raw pointers are not used. Instead, each view contains a pair of 32 bit unsigned integers which encode the index of a character buffer (string view arrays may consist of a variable number of such buffers) and the offset of a view's data within that buffer respectively. Benefits of this substitution include: - This makes explicit the guarantee that lifetime of all character data is equal to that of the array which views it, which is critical for confident consumption across an interface boundary. - As with other types in the arrow format, such arrays are serializable and venue agnostic; directly usable in shared memory without modification. - Indices and offsets are easily validated. Accessing the data requires some trivial pointer arithmetic, but in benchmarking this had negligible impact on sequential access and only minor impact on random access. In the C++ implementation, raw pointer string views are supported as an extended case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`. Branching on this access pattern bit at the data type level has negligible impact on performance since the branch resides outside any hot loops. Utility functions are provided for efficient (potentially in-place) conversion between raw pointer and index offset views. For example, the C++ implementation could zero copy a raw pointer array from Velox, filter it, then convert to index/offset for serialization. Other implementations may choose to accommodate or eschew raw pointer views as their communities direct. Where desirous in a rigorously controlled context this still enables construction and safe consumption of string view arrays which reference memory not directly bound to the lifetime of the array. I'm not sure when or if we would find it useful to have arrays like this; I do not introduce any in [3]. I mention this possibility to highlight that if benchmarking demonstrates that such an approach brings a significant performance benefit to some operation, the only barrier to its adoption would be code review. Likewise if more intensive benchmarking determines that raw pointer views critically outperform index/offset views for real-world analytics tasks, prioritizing raw pointer string views for usage within the C++ implementation will be straightforward. See also the proposal to Velox that their string view vector be refactored in a similar vein [4]. Sincerely, Ben Kietzman [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf [3] https://github.com/apache/arrow/pull/35628 [4] https://github.com/facebookincubator/velox/discussions/4362