bkietz opened a new pull request, #35628:
URL: https://github.com/apache/arrow/pull/35628
String view (and equivalent non-utf8 binary view) is an alternative
representation for
variable length strings which offers greater efficiency for several common
operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those
databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers
as a
buffer index and offset, which
- makes explicit the guarantee that lifetime of all character data is equal
to that of the array which views it, which is critical for confident
consumption across an interface boundary
- makes the arrays meaningfully serializable and
venue agnostic; directly usable in shared memory without modification
- allows easy validation
Changes outside the C++ implementation:
- New types added to `Schema.fbs`
- `Message.fbs` amended to support variable buffer counts between string
view chunks
- `datagen.py` extended to produce integration JSON for string view arrays
- `Columnar.rst` amended with a description of the string view format
Changes to the C++ implementation:
- The new types are available with new subclasses of DataType, Array,
ArrayBuilder, ...
- The values of string view arrays can be visited as `std::string_view` as
with StringArray
- String view arrays can be round tripped through IPC, parquet, and
integration JSON
- A variant of the string view type `utf8_view(/*has_raw_pointers=*/true)`
is supported
which uses raw pointer views. This enables zero copy interop with code
which uses
raw pointer views.
- Conversions are provided between index/offset view arrays, raw pointer
view arrays, and
regular string arrays.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]