pitrou opened a new pull request, #46678: URL: https://github.com/apache/arrow/pull/46678
### Rationale for this change When reading a Parquet LIST logical type (or a repeated field without a logical type), Parquet C++ automatically reads it as a Arrow List array. However, this can in some cases run into the 32-bit offsets limit. We'd like to be able to choose to read as LargeList instead, even if there is no serialized Arrow schema in the Parquet file. ### What changes are included in this PR? * add a Parquet read option `list_type` to select which Arrow type to read LIST / repeated Parquet columns into * fix an index truncation bug when writing a huge single chunk of data to Parquet ### Are these changes tested? Yes, the functionality is tested. However, I wasn't able to write a unit test that wouldn't consume a horrendous amount of memory writing/reading a list with offsets larger than 2**32. ### Are there any user-facing changes? No, only an API improvement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
