Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

via GitHub Wed, 05 Nov 2025 08:05:30 -0800


alamb commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3492066161


   > I get how `LargeUtf8` would solve the issue, but could you briefly explain 
how `StringView` works and how it's able to solve the 32-bit overflow issue? 
I'm familiar with the concept of string views in C++, but I don't quite get 
what it would do in Arrow.
   
   I think this blog does a pretty good explaining what is going on: 
https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1
 (basically strings are stored as a pointer / offset with multiple buffers)
   
   > As arrow-rs does not have the concept of "ChunkedArray" or "Table", we can 
only ever read as a single RecordBatch here. 
   
   I think `Vec<RecordBatch>` is the equivalent -- you certainly don't have to 
read the entire dataset into a single `RecordBatch`
   
   > Would a single StringView RecordBatch be able to make more than 2**32 
bytes of string data available? Wouldn't the string view still need a pointer 
larger than 32-bit to access all data?
   
   Yes -- it can do more than 2^32 bytes
   
   The way it does so is that it has multiple "buffers" (which can each be 2GB) 
and then stores the pointer as a buffer index (i32) and buffer offset (i32)
   
   You can read more here
   - 
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

Reply via email to