alamb commented on issue #8441:
URL: https://github.com/apache/arrow-rs/issues/8441#issuecomment-3352631068

   I have completed my initial benchmark testing (details on 
https://github.com/alamb/parquet_footer_parsing)
   
   Summary is here (nice work @etseidl and @jhorstmann for the thrift decoding 
work)
   
   I'll post a version of this to the parquet mailing list later today
   
   
   ## Summary
   
   This benchmarks demonstrates nearly an order of magnitude improvement (7x)
   parsing Parquet metadata with **no changes to the Parquet format**, by simply
   writing a more efficient thrift decoder.
   
   While we have not implemented a similar decoder in other languages such as 
C/C++
   or Java, given the similarities in the existing thrift libraries and usage, 
we
   expect similar improvements are possible in those languages as well.
   
   <img width="1080" height="681" alt="Image" 
src="https://github.com/user-attachments/assets/3029a506-9e35-4af3-ab8f-7cff8b3eeec4";
 />
   
   **Figure 1**: Benchmark results for [Apache Parquet] metadata parsing using 
the [new thrift decoder] in [arrow-rs], scheduled for release in 
   [57.0.0]. No changes are needed to the Parquet format itself.
   
   <img width="1060" height="596" alt="Image" 
src="https://github.com/user-attachments/assets/695db9e5-18f0-4c96-8db9-a4a9770d708d";
 />
   
   **Figure 2**: Speedup for Apache Parquet metadata parsing for varying data 
types and column counts.
   
   [Apache Parquet]: https://parquet.apache.org/
   [arrow-rs]: https://github.com/apache/arrow-rs
   [57.0.0]: https://github.com/apache/arrow-rs/issues/7835
   
   
   *Note 1: the "no stats" version is a modified version of the new thrift 
parser
   that skips over all index structures entirely, including statistics on column
   chunks as well as page and offset indexes.*
   
   *Note 2: These results show the theoretical best case improvements (e.g. when
   doing point lookups in Parquet files using an external index, as explained in
   the [Using External Indexes, Metadata Stores, Catalogs and Caches to 
Accelerate
   Queries on Apache Parquet]). Most workloads will see more modest 
improvements.*
   
   [Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate 
Queries on Apache Parquet]: 
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
   [Apache DataFusion]: https://datafusion.apache.org/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to