alamb commented on issue #8441: URL: https://github.com/apache/arrow-rs/issues/8441#issuecomment-3352631068
I have completed my initial benchmark testing (details on https://github.com/alamb/parquet_footer_parsing) Summary is here (nice work @etseidl and @jhorstmann for the thrift decoding work) I'll post a version of this to the parquet mailing list later today ## Summary This benchmarks demonstrates nearly an order of magnitude improvement (7x) parsing Parquet metadata with **no changes to the Parquet format**, by simply writing a more efficient thrift decoder. While we have not implemented a similar decoder in other languages such as C/C++ or Java, given the similarities in the existing thrift libraries and usage, we expect similar improvements are possible in those languages as well. <img width="1080" height="681" alt="Image" src="https://github.com/user-attachments/assets/3029a506-9e35-4af3-ab8f-7cff8b3eeec4" /> **Figure 1**: Benchmark results for [Apache Parquet] metadata parsing using the [new thrift decoder] in [arrow-rs], scheduled for release in [57.0.0]. No changes are needed to the Parquet format itself. <img width="1060" height="596" alt="Image" src="https://github.com/user-attachments/assets/695db9e5-18f0-4c96-8db9-a4a9770d708d" /> **Figure 2**: Speedup for Apache Parquet metadata parsing for varying data types and column counts. [Apache Parquet]: https://parquet.apache.org/ [arrow-rs]: https://github.com/apache/arrow-rs [57.0.0]: https://github.com/apache/arrow-rs/issues/7835 *Note 1: the "no stats" version is a modified version of the new thrift parser that skips over all index structures entirely, including statistics on column chunks as well as page and offset indexes.* *Note 2: These results show the theoretical best case improvements (e.g. when doing point lookups in Parquet files using an external index, as explained in the [Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet]). Most workloads will see more modest improvements.* [Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries on Apache Parquet]: https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/ [Apache DataFusion]: https://datafusion.apache.org/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
