Re: [I] Use custom thrift decoder to improve speed of parsing parquet metadata [arrow-rs]

via GitHub Mon, 04 Aug 2025 11:08:29 -0700


etseidl commented on issue #5854:
URL: https://github.com/apache/arrow-rs/issues/5854#issuecomment-3151859881


   Bumping this rather than creating a new issue. Also rolling in #7909 and 
#6129.
   
   Here's what I'm planning:
   
   1. Add more thrift processing benchmarks
   2. Reduce use of `parquet::format` as much as possible, especially in 
publicly exposed data structures like `FileMetaData`.
   3. Create a custom thrift parser to decode directly to the structures 
created in step 2. Part of this task will address #7909 by correctly dealing 
with unknown union values for `LogicalType` and `ColumnOrder`. This step will 
also leverage the macros developed by @jhorstmann 
(https://github.com/jhorstmann/compact-thrift).
   4. Use parser from 3 internally to read non-exposed structures such as the 
page headers.
   5. Add ability to write new structures directly to thrift-encoded bytes.
   6. Remove the `format` module.
   7. Explore opportunities for further speed ups. Examples include skipping 
row groups and projected columns, not decoding page statistics, halt processing 
after reading schema.
   
   Hopfully I can have all of the above ready in time for 57.0.0 😅 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Use custom thrift decoder to improve speed of parsing parquet metadata [arrow-rs]

Reply via email to