jonded94 commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3492011640
> > (using our own PyO3 wrapper layer around arrow-rs here) > > Can you point me at that? Sadly no, since that is a company-internal library 😅 But it's in principle very similar to `arro3` (https://github.com/kylebarron/arro3) or even directly inspired by it in parts. > I think the workaround is to read the data as a LargeStringArray or StringView (DataType::LargeUtf8 or DataType::StringView) > [...] > Maybe we should change the default type read by arrow-rs for Strings to StringView 🤔 I get how `LargeUtf8` would solve the issue, but could you briefly explain how `StringView` works and how it's able to solve the 32-bit overflow issue? I'm familiar with the concept of string views in C++, but I don't quite get what it would do in Arrow. Slightly tangential: In this particular issue, I think `pyarrow.ParquetFile(...).read_row_group(0)` wouldn't crash because it's returning a `Table` which internally is representing the columns as `ChunkedArray` instead of `Array`, i.e. it's circumventing the issue that a single `StringArray` has to point to more than `2**32` bytes of String data by just splitting the arrays up. As `arrow-rs` does not have the concept of "ChunkedArray" or "Table", we can only ever read as a single RecordBatch here. Would a single `StringView` RecordBatch be able to make more than `2**32` bytes of string data available? Wouldn't the string view still need a pointer larger than 32-bit to access all data? Also another stupid question, sorry: How/where can you set with which type a specific column shall be read in `arrow-rs`? We have some schema upcasting logic in our internal library, but we apply that *after* the data was read from `arrow-rs`, so it actually didn't help us here since we already error out from stuff happening in `arrow-rs` first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
