Hello, I did some benchmarking for the new parser[2] we are working on in arrow-rs.
This benchmark achieves nearly an order of magnitude improvement (7x) parsing Parquet metadata with no changes to the Parquet format, by simply writing a more efficient thrift decoder (which can also skip statistics). While we have not implemented a similar decoder in other languages such as C/C++ or Java, given the similarities in the existing thrift libraries and usage, we expect similar improvements are possible in those languages as well. Here are some inline images: [image: image.png] [image: image.png] You can find full details here [1] Andrew [1]: https://github.com/alamb/parquet_footer_parsing [2]: https://github.com/apache/arrow-rs/issues/5854 On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote: > > Concerning Thrift optimization, while a 2-3x improvement might be > > achievable, Flatbuffers are currently demonstrating a 10x improvement. > > Andrew, do you have a more precise estimate for the speedup we could > expect > > in C++? > > Given my past experience on cuDF, I'd estimate about 2X there as well. > cuDF has it's own metadata parser that I once benchmarked against the > thrift generated parser. > > And I'd point out that beyond the initial 2X improvement, rolling your own > parser frees you of having to parse out every structure in the metadata. >
