Thank you Andrew for putting the code in open source so that we can repro it.
We have run the rust benchmarks and also run the flatbuf proposal with our C++ thrift parser, the flatbuf footer with Thrift conversion, the flatbuf footer without Thrift conversion, and the flatbuf footer without Thrift conversion and without verification. You can find the summary of our findings in a separate tab in the proposal document: https://docs.google.com/document/d/1kZS_DM_J8n6NKff3vDQPD1Y4xyDdRceYFANUE0bOfb0/edit?tab=t.ve65qknb3sq1#heading=h.3uwb5liauf1s The TLDR is that flatbuf is 5x faster with the Thrift conversion vs the optimized Thrift parsing. It also remains faster than the Thrift parser even if the Thrift parser skips statistics. Furthermore if Thrift conversion is skipped, the speedup is 50x, and if verification is skipped it goes beyond 150x. On Tue, Sep 30, 2025 at 5:56 PM Andrew Lamb <[email protected]> wrote: > Hello, > > I did some benchmarking for the new parser[2] we are working on in > arrow-rs. > > This benchmark achieves nearly an order of magnitude improvement (7x) > parsing Parquet metadata with no changes to the Parquet format, by simply > writing a more efficient thrift decoder (which can also skip statistics). > > While we have not implemented a similar decoder in other languages such as > C/C++ or Java, given the similarities in the existing thrift libraries and > usage, we expect similar improvements are possible in those languages as > well. > > Here are some inline images: > [image: image.png] > [image: image.png] > > > You can find full details here [1] > > Andrew > > > [1]: https://github.com/alamb/parquet_footer_parsing > [2]: https://github.com/apache/arrow-rs/issues/5854 > > > On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote: > >> > Concerning Thrift optimization, while a 2-3x improvement might be >> > achievable, Flatbuffers are currently demonstrating a 10x improvement. >> > Andrew, do you have a more precise estimate for the speedup we could >> expect >> > in C++? >> >> Given my past experience on cuDF, I'd estimate about 2X there as well. >> cuDF has it's own metadata parser that I once benchmarked against the >> thrift generated parser. >> >> And I'd point out that beyond the initial 2X improvement, rolling your >> own parser frees you of having to parse out every structure in the metadata. >> >
