alamb commented on code in PR #711:
URL: https://github.com/apache/arrow-site/pull/711#discussion_r2432951974
##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -0,0 +1,392 @@
+---
+layout: post
+title: "Parsing Apache Parquet Footer Metadata Using a Custom Thrift Parser in
Rust"
+date: "2025-10-08 00:00:00"
+author: "Andrew Lamb (InfluxData)"
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+*Editor’s Note: While [Apache Arrow] and [Apache Parquet] are separate
+projects, this post is part of the Arrow site because the Arrow [arrow-rs]
+repository hosts the development of the [parquet] Rust crate, a widely used and
+high-performance implementation of the Parquet format.*
+
+## Summary
+
+Version `57.0.0` of the [parquet] Rust crate decodes metadata roughly twice as
+fast as previous versions thanks to a new custom [Apache Thrift] parser. The
new
+parser is 2× faster in all cases and sets the stage for further performance
improvements not
+possible with generated parsers, such as skipping unnecessary fields and
selective parsing.
+
+<!-- AAL: TODO: update the benchmark and charts with results from 57.0.0 -->
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl }}/img/rust-parquet-metadata/results.png"
width="100%" class="img-responsive" alt="" aria-hidden="true">
+</div>
+
+*Figure 1:* Performance comparison of [Apache Parquet] metadata parsing using
a generated
+Thrift parser (versions `56.2.0` and earlier) and the new
+[custom Thrift decoder] in [arrow-rs] version [57.0.0]. No
+changes are needed to the Parquet format itself.
+See the [benchmark page] for more details.
+
+[parquet]: https://crates.io/crates/parquet
+[Apache Arrow]: https://arrow.apache.org/
+[Apache Parquet]: https://parquet.apache.org/
+[custom Thrift decoder]: https://github.com/apache/arrow-rs/issues/5854
+[arrow-rs]: https://github.com/apache/arrow-rs
+[57.0.0]: https://github.com/apache/arrow-rs/issues/7835
+
+[benchmark page]: https://github.com/alamb/parquet_footer_parsing
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl }}/img/rust-parquet-metadata/scaling.png"
width="100%" class="img-responsive" alt="Scaling behavior of custom Thrift
parser" aria-hidden="true">
+</div>
+
+*Figure 2:* Speedup of the [custom Thrift decoder] for string and
floating-point data types,
+for `100`, `1000`, and `100,000` columns. The new parser is faster in all
cases,
+and the speedup is similar regardless of the number of columns. See the
[benchmark page] for more details.
+
+## Introduction: Parquet and the Importance of Metadata Parsing
+
+[Apache Parquet] is a popular columnar storage format for big data processing.
It
+is designed to be efficient for both storage and query performance. Parquet
+files consist of a header, a series of data pages, and a footer, as shown in
Figure 3. The footer
+contains metadata about the file, including schema, statistics, and other
+information needed to decode the data.
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl }}/img/rust-parquet-metadata/parquet.png"
width="100%" class="img-responsive" alt="Physical File Structure of Parquet"
aria-hidden="true">
+</div>
+
+*Figure 3:* Structure of a Parquet file showing the header, data pages, and
footer metadata.
+
+Getting information stored in the footer is typically the first step in reading
+a Parquet file, as it is required to interpret the data pages. *Parsing* the
+footer is often on the critical path for reading data:
+
+* When reading from fast local storage, such as modern NVMe SSDs, footer
parsing
+ must be completed before data pages are read, placing it on the critical
+ I/O path.
+* Footer parsing scales linearly with the number of columns and row groups in a
+ Parquet file and thus can be a bottleneck for tables with many columns or
files
+ with many row groups.
+* For systems that cache the parsed footer in memory, as explained in [Using
+ External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries
+ on Apache Parquet], the footer must be parsed when the cache is cold or the
+ hit rate is low.
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl }}/img/rust-parquet-metadata/flow.png"
width="100%" class="img-responsive" alt="Typical Parquet processing flow"
aria-hidden="true">
+</div>
+
+*Figure 4:* Typical processing flow for Parquet files for stateless and
stateful systems.
+The performance of footer parsing is important for both types of systems, but
especially
+for stateless systems that do not cache the parsed footer.
+
+[Using External Indexes, Metadata Stores, Catalogs and Caches to Accelerate
Queries on Apache Parquet]:
https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/
+
+The speed of parsing metadata has grown even more important as Parquet spreads
+throughout the data ecosystem and is used for more latency-sensitive workloads
such
+as observability (TODO find citations), interactive analytics, and single-point
+lookups for Retrieval-Augmented Generation (RAG) applications feeding LLMs
(TODO
+find citations). As overall query times decrease, the relative
+importance of footer parsing increases.
+
+## Background: Apache Thrift
+
+Parquet stores metadata using [Apache Thrift], a framework for
+network data types and service interfaces. It includes a [data definition
+language] similar to [Protocol Buffers]. Thrift definition files describe data
+types in a language-neutral way, and systems use code generators to
+automatically create code for a specific programming language to read and write
+those data types.
+
+The [parquet.thrift] file defines the format of Parquet metadata that is
+serialized at the end of each Parquet file in the [Thrift compact binary
+encoding format], as shown below in Figure 5. The binary encoding is
"variable-length",
+meaning that the length of each element depends on its content, not
+just its type. Smaller-valued primitive types are encoded in fewer bytes than
+larger values, and strings and lists are stored inline, prefixed with their
+length.
+
+This encoding is space-efficient but, due to being variable-length, does not
+support random access: it is not possible to locate a particular field without
+scanning all previous fields. Other formats such as [Flatbuffers] provide
+random-access parsing and have [been proposed as alternatives] given their
+theoretical performance advantages. However, changing the Parquet format is a
+significant undertaking, requires buy-in from the community and ecosystem,
+and would likely take years to be adopted.
+
+[How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?]:
https://www.influxdata.com/blog/how-good-parquet-wide-tables/
+[Apache Thrift]: https://thrift.apache.org/
+[Flatbuffers]: https://google.github.io/flatbuffers/
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl
}}/img/rust-parquet-metadata/thrift-compact-encoding.png" width="100%"
class="img-responsive" alt="Thrift Compact Encoding Illustration"
aria-hidden="true">
+</div>
+
+*Figure 5:* Parquet metadata is serialized using the [Thrift compact binary
+encoding format]. Each field is stored using a variable number of bytes that
+depends on its value. Primitive types use a variable-length encoding and
strings
+and lists are prefixed with their lengths.
+
+[Thrift compact binary encoding format]:
https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md
+[Protocol Buffers]: https://developers.google.com/protocol-buffers
+[data definition language]: https://thrift.apache.org/docs/idl
+[parquet.thrift]:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
+[gRPC]: https://grpc.io/
+[Xiangpeng Hao]: https://xiangpeng.systems/
+
+Despite Thrift's very real disadvantage due to lack of random access, software
+optimizations are much easier to deploy than format changes, and thus we
decided
+to explore this approach first. [Xiangpeng Hao] previously measured significant
+(2x–4x) potential performance improvements simply by optimizing the
+implementation of Parquet footer parsing. See the blog post [How Good is
+Parquet for Wide Tables (Machine Learning Workloads) Really?] for more details.
+
+## Parsing Thrift Using Generated Parsers
+
+*Parsing* Parquet metadata is the process of decoding the Thrift-encoded bytes
+into in-memory structures that can be used for computation. Most Parquet
+implementations use one of the existing [Thrift compilers] to generate a parser
+that converts Thrift binary data into generated code structures, and then copy
+relevant portions of those generated structures into their API-level
structures.
+For example, the [C/C++ Parquet implementation] includes a two-step process,
+as does [parquet-java]. [DuckDB] also contains a Thrift compiler–generated
+parser.
+
+In versions `56.2.0` and earlier, the Apache Arrow Rust implementation used the
+same pattern. The [format] module contains a parser generated by the [thrift
+crate] and the [parquet.thrift] definition. To parse metadata, it:
+
+1. Invokes the generated parser on the Thrift binary data, producing
+ generated in-memory structures (e.g., `struct FileMetaData`), then
+2. Copies the relevant fields into a more user-friendly representation,
+ [`ParquetMetadata`].
+
+[thrift crate]: https://crates.io/crates/thrift
+[format]: https://docs.rs/parquet/56.2.0/parquet/format/index.html
+[`ParquetMetadata`]:
https://docs.rs/parquet/56.2.0/parquet/file/metadata/struct.ParquetMetaData.html
+[`struct FileMetaData`]:
https://docs.rs/parquet/56.2.0/parquet/format/struct.FileMetaData.html
+
+[two]:
https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/build-support/update-thrift.sh#L23
+[step process]:
https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet/thrift_internal.h#L56
+[C/C++ Parquet implementation]:
https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet
+[parquet-java]:
https://github.com/apache/parquet-java/blob/0fea3e1e22fffb0a25193e3efb9a5d090899458a/parquet-format-structures/pom.xml#L69-L88
+[DuckDB]:
https://github.com/duckdb/duckdb/blob/8f512187537c65d36ce6d6f562b75a37e8d4ee54/third_party/parquet/parquet_types.h#L1-L6
+
+<!-- Image source:
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
-->
+<div style="display: flex; gap: 16px; justify-content: center; align-items:
flex-start;">
+ <img src="{{ site.baseurl
}}/img/rust-parquet-metadata/original-pipeline.png" width="100%"
class="img-responsive" alt="Original Parquet Parsing Pipeline"
aria-hidden="true">
+</div>
+
+*Figure 6:* Two-step process to read Parquet metadata: A parser created with
the
Review Comment:
Thanks -- fixed in 3066161e695
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]