etseidl opened a new issue, #8713:
URL: https://github.com/apache/arrow-rs/issues/8713

   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   Related to #5855 and #5853.
   
   One of the pain points of reading Parquet files is the all-or-nothing nature 
of the file metadata, which is stored as a Thrift encoded blob in the file's 
footer. A traditional parser built from Thrift generated code will decode the 
entire `FileMetaData` structure, which can be very costly with extremely large 
schemas. The new parsing code introduced recently (#5854) can reduce this cost 
some by skipping unwanted structures, but as currently implemented it still 
needs to process the Thrift framing even if not fully decoding everything.
   
   **Describe the solution you'd like**
   One solution to the above is to provide an index into the serialized 
metadata so that only the structures requested are parsed. A full 
implementation of this would be used along with either row group selections or 
column projections, and would also be of use for predicate processing (only 
read column chunk statistics for columns present in the predicate, for 
instance). This will also need the options object detailed in #8643.
   
   Such an index could be embedded in the `FileMetaData` in the manner 
described in the [Binary Protocol 
Extensions](https://github.com/apache/parquet-format/blob/master/BinaryProtocolExtensions.md)
 section of the Parquet specification.
   
   **Describe alternatives you've considered**
   <!--
   A clear and concise description of any alternative solutions or features 
you've considered.
   -->
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to