Re: [I] Use custom thrift decoder to improve speed of parsing parquet metadata [arrow-rs]

via GitHub Tue, 05 Aug 2025 04:07:12 -0700


alamb commented on issue #5854:
URL: https://github.com/apache/arrow-rs/issues/5854#issuecomment-3154746369


   > Bumping this rather than creating a new issue. Also rolling in 
[#7909](https://github.com/apache/arrow-rs/issues/7909) and 
[#6129](https://github.com/apache/arrow-rs/issues/6129).
   > 
   > Here's what I'm planning:
   > 
   > 1. Add more thrift processing benchmarks
   > 2. Reduce use of `parquet::format` as much as possible, especially in 
publicly exposed data structures like `FileMetaData`.
   
   This will be awesome -- I have likewise found the format module very hard to 
work with
   
   > 4. Create a custom thrift parser to decode directly to the structures 
created in step 2. Part of this task will address [[Parquet] reader appears to 
have bug when supporting unknown sort orders 
#7909](https://github.com/apache/arrow-rs/issues/7909) by correctly dealing 
with unknown union values for `LogicalType` and `ColumnOrder`. This step will 
also leverage the macros developed by 
[@jhorstmann](https://github.com/jhorstmann) 
(https://github.com/jhorstmann/compact-thrift).
   
   > 6. Use parser from 3 internally to read non-exposed structures such as the 
page headers.
   
   This will be awesome -- when I was working on the parquet push decoder I 
also found that the existing thrift decoder copies data that is already in 
memory *AGAIN* using the `std::io::Read` trait -- it would be awesome if the 
new decoder took data in in memory `Buffer`. 
   
   Note sure what your plans are, but I would like to recommend a "push" 
decoder that avoids having to do IO at all - similarly to what I have in 
https://github.com/apache/arrow-rs/pull/7997. A taste from 
`parquet/src/file/metadata/push_decoder.rs`
   
   ```rust
   /// The `ParquetMetaDataPushDecoder` needs to know the file length.
   /// let mut decoder = ParquetMetaDataPushDecoder::try_new(file_len).unwrap();
   /// // try to decode the metadata. If more data is needed, the decoder will 
tell you what ranges
   /// loop {
   ///     match decoder.try_decode() {
   ///        Ok(DecodeResult::Data(metadata)) => { return Ok(metadata); } // 
decode successful
   ///        Ok(DecodeResult::NeedsData(ranges)) => {
   ///           // The decoder needs more data
   ///           //
   ///           // In this example, we call a function that returns the bytes 
for each given range.
   ///           // In a real application, you would likely read the data from 
a file or network.
   ///           let data = ranges.iter().map(|range| 
get_range(range)).collect();
   ///           // Push the data into the decoder and try to decode again on 
the next iteration.
   ///           decoder.push_ranges(ranges, data).unwrap();
   ///        }
   ///        Ok(DecodeResult::Finished) => { unreachable!("returned metadata 
in previous match arm") }
   ///        Err(e) => return Err(e),
   ///     }
   /// }
   /// # }
   ```
   
   > 8. Add ability to write new structures directly to thrift-encoded bytes.
   > 9. Remove the `format` module
   > 11. Explore opportunities for further speed ups. Examples include skipping 
row groups and projected columns, not decoding page statistics, halt processing 
after reading schema.> > Hopfully I can have all of the above ready in time for 
57.0.0 😅
   
   Sounds ambitious! 
   
   However, I am around to review things -- l please feel free to ping me and 
I'll try my best to get them reviewed


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Use custom thrift decoder to improve speed of parsing parquet metadata [arrow-rs]

Reply via email to