bkietz opened a new pull request, #37526:
URL: https://github.com/apache/arrow/pull/37526
String view (and equivalent non-utf8 binary view) is an alternative
representation for
variable length strings which offers greater efficiency for several common
operations.
This representation is in use by UmbraDB, DuckDB, and Velox. Where those
databases use
a raw pointer to out-of-line strings this PR uses a pair of 32 bit integers
as a
buffer index and offset, which
- makes explicit the guarantee that lifetime of all character data is equal
to that of the array which views it, which is critical for confident
consumption across an interface boundary
- makes the arrays meaningfully serializable and
venue agnostic; directly usable in shared memory without modification
- allows easy validation
This PR is extracted from https://github.com/apache/arrow/pull/35628 to
unblock independent PRs now that the vote has passed, including:
- New types added to Schema.fbs
- Message.fbs amended to support variable buffer counts between string
view chunks
- datagen.py extended to produce integration JSON for string view arrays
- Columnar.rst amended with a description of the string view format
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]