[PR] GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly [arrow]

via GitHub Mon, 02 Jun 2025 09:06:05 -0700


pitrou opened a new pull request, #46678:
URL: https://github.com/apache/arrow/pull/46678


   ### Rationale for this change
   
   When reading a Parquet LIST logical type (or a repeated field without a 
logical type), Parquet C++ automatically reads it as a Arrow List array.
   
   However, this can in some cases run into the 32-bit offsets limit. We'd like 
to be able to choose to read as LargeList instead, even if there is no 
serialized Arrow schema in the Parquet file.
   
   ### What changes are included in this PR?
   
   * add a Parquet read option `list_type` to select which Arrow type to read 
LIST / repeated Parquet columns into
   * fix an index truncation bug when writing a huge single chunk of data to 
Parquet
   
   ### Are these changes tested?
   
   Yes, the functionality is tested. However, I wasn't able to write a unit 
test that wouldn't consume a horrendous amount of memory writing/reading a list 
with offsets larger than 2**32.
   
   ### Are there any user-facing changes?
   
   No, only an API improvement.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] GH-46676: [C++][Python][Parquet] Allow reading Parquet LIST data as LargeList directly [arrow]

Reply via email to