alamb commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3492066161
> I get how `LargeUtf8` would solve the issue, but could you briefly explain how `StringView` works and how it's able to solve the 32-bit overflow issue? I'm familiar with the concept of string views in C++, but I don't quite get what it would do in Arrow. I think this blog does a pretty good explaining what is going on: https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1 (basically strings are stored as a pointer / offset with multiple buffers) > As arrow-rs does not have the concept of "ChunkedArray" or "Table", we can only ever read as a single RecordBatch here. I think `Vec<RecordBatch>` is the equivalent -- you certainly don't have to read the entire dataset into a single `RecordBatch` > Would a single StringView RecordBatch be able to make more than 2**32 bytes of string data available? Wouldn't the string view still need a pointer larger than 32-bit to access all data? Yes -- it can do more than 2^32 bytes The way it does so is that it has multiple "buffers" (which can each be 2GB) and then stores the pointer as a buffer index (i32) and buffer offset (i32) You can read more here - https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
