Hello,

I did some benchmarking for the new parser[2] we are working on in
arrow-rs.

This benchmark achieves nearly an order of magnitude improvement (7x)
parsing Parquet metadata with no changes to the Parquet format, by simply
writing a more efficient thrift decoder (which can also skip statistics).

While we have not implemented a similar decoder in other languages such as
C/C++ or Java, given the similarities in the existing thrift libraries and
usage, we expect similar improvements are possible in those languages as
well.

Here are some inline images:
[image: image.png]
[image: image.png]


You can find full details here [1]

Andrew


[1]: https://github.com/alamb/parquet_footer_parsing
[2]: https://github.com/apache/arrow-rs/issues/5854


On Wed, Sep 24, 2025 at 5:59 PM Ed Seidl <[email protected]> wrote:

> > Concerning Thrift optimization, while a 2-3x improvement might be
> > achievable, Flatbuffers are currently demonstrating a 10x improvement.
> > Andrew, do you have a more precise estimate for the speedup we could
> expect
> > in C++?
>
> Given my past experience on cuDF, I'd estimate about 2X there as well.
> cuDF has it's own metadata parser that I once benchmarked against the
> thrift generated parser.
>
> And I'd point out that beyond the initial 2X improvement, rolling your own
> parser frees you of having to parse out every structure in the metadata.
>

Reply via email to