Re: [PR] [Website]: Blog post about new Rust Parquet Metadata parser [arrow-site]

via GitHub Thu, 09 Oct 2025 12:44:34 -0700


etseidl commented on code in PR #711:
URL: https://github.com/apache/arrow-site/pull/711#discussion_r2417761635



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -24,10 +24,10 @@ limitations under the License.
 {% endcomment %}
 -->
 
-
-*Editors Note: While the [Apache Arrow] and [Apache Parquet] are separate 
projects,
-the [arrow-rs] repository hosts the development of the  [parquet] Rust crate, 
a widely used
-and high performance implementation.* 
+*Editors Note: While the [Apache Arrow] and [Apache Parquet] are separate

Review Comment:
   ```suggestion
   *Editors Note: While [Apache Arrow] and [Apache Parquet] are separate
   ```



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -178,67 +202,155 @@ but this crate is not optimized for speed, and so we 
have been looking for alter
 Figure 2: Two step process to read parquet metadata: use code generated by the 
thrift crate to parse
 the thrift binary into in-memory structures, then convert the in-memory 
structures into arrow's
 
-## New Design: Custom Parser
-
-As is typical of code generated from another tool, oportunities for 
optimization
-are typically limited, both because the generated is not easy to modify.
-(TODO find exmaples of trying to help the thrift crate be faster)
+The parsers generated by standard Thrift compilers typically parse *all* fields
+in a single pass over the thrift encoded bytes, copying data into in-memory, 
heap
+allocated structures such as a Rust [`Vec`], or C++ [`std::vector`], as
+shown in the figure below. This approach is simple and straightforward and a 
good
+choice given Thrift's design point of encoding network messages, which 
typically
+don't send extraneous information. However, reading all fields and allocating
+memory for them is not necessary in many cases when reading Parquet metadata.
 
-For example, the last release of the rust thrift compiler crate 
https://crates.io/crates/thrift/0.17.0
-was three years ago, and the last commit to the repository was over a year ago
-https://crates.io/crates/thrift
-We hope we can also help make the rust thrift crate to be better. 
 
+[`Vec`]: https://doc.rust-lang.org/std/vec/struct.Vec.html
+[`std::vector`]: https://en.cppreference.com/w/cpp/container/vector.html
+[thrift compilers]: https://thrift.apache.org/lib/
 
-So we instead wrote a custom parser that parses the thrift binary directly 
into the needed
-structures.
+For some use cases, such as caching the parsed Parquet metadata, all the fields
+are needed and thus must be parsed. However, for many use cases, parsing the
+entire metadata into in memory structures is wasteful. For example, a query 
that
+reads only 10 columns from a file with 1000 columns with a single column 
predicate
+(e.g. `time > now() - '1 minute'`) only needs [`Statistics`] (or
+[`ColumnIndex`]) for the predicate column and the [`ColumnChunk`] for the 10 
columns. Parsing (allocating and copying)
+statistics for remaining 999 columns which are not used in predicates is
+unnecessary work. While all the metadata bytes must still be fetched and 
scanned
+to find the relevant positions, CPUs are quite fast at scanning data, so 
skipping parsing
+unnecessary data can speed up overall metadata parsing significantly.
 
 
-The speedup was achieved by using a custom parser that is optimized for the 
specific
-subset of the Thrift format used by Parquet, and by using various performance 
optimizations.
+[`Statistics`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
+[`ColumnIndex`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163
+[`ColumnChunk`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958
 
 
 <!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
 <div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
-  <img src="{{ site.baseurl }}/img/rust-parquet-metadata/new-pipeline.png" 
width="100%" class="img-responsive" alt="New Parquet Parsing Pipeline" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/rust-parquet-metadata/thrift-parsing-allocations.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
 </div>
 
-Figure 3: New one step process to read parquet metadata: use a custom parser 
to parse the thrift binary directly
-own in-memory representation of parquet metadata.
+*Figure XX*: Generated thrift parsers typically parse into heap allocated 
structures requiring
+in many small heap allocations, which are expensive.

Review Comment:
   ```suggestion
   many small heap allocations, which are expensive.
   ```



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -178,67 +202,155 @@ but this crate is not optimized for speed, and so we 
have been looking for alter
 Figure 2: Two step process to read parquet metadata: use code generated by the 
thrift crate to parse
 the thrift binary into in-memory structures, then convert the in-memory 
structures into arrow's
 
-## New Design: Custom Parser
-
-As is typical of code generated from another tool, oportunities for 
optimization
-are typically limited, both because the generated is not easy to modify.
-(TODO find exmaples of trying to help the thrift crate be faster)
+The parsers generated by standard Thrift compilers typically parse *all* fields
+in a single pass over the thrift encoded bytes, copying data into in-memory, 
heap
+allocated structures such as a Rust [`Vec`], or C++ [`std::vector`], as
+shown in the figure below. This approach is simple and straightforward and a 
good
+choice given Thrift's design point of encoding network messages, which 
typically
+don't send extraneous information. However, reading all fields and allocating
+memory for them is not necessary in many cases when reading Parquet metadata.
 
-For example, the last release of the rust thrift compiler crate 
https://crates.io/crates/thrift/0.17.0
-was three years ago, and the last commit to the repository was over a year ago
-https://crates.io/crates/thrift
-We hope we can also help make the rust thrift crate to be better. 
 
+[`Vec`]: https://doc.rust-lang.org/std/vec/struct.Vec.html
+[`std::vector`]: https://en.cppreference.com/w/cpp/container/vector.html
+[thrift compilers]: https://thrift.apache.org/lib/
 
-So we instead wrote a custom parser that parses the thrift binary directly 
into the needed
-structures.
+For some use cases, such as caching the parsed Parquet metadata, all the fields
+are needed and thus must be parsed. However, for many use cases, parsing the
+entire metadata into in memory structures is wasteful. For example, a query 
that
+reads only 10 columns from a file with 1000 columns with a single column 
predicate
+(e.g. `time > now() - '1 minute'`) only needs [`Statistics`] (or
+[`ColumnIndex`]) for the predicate column and the [`ColumnChunk`] for the 10 
columns. Parsing (allocating and copying)
+statistics for remaining 999 columns which are not used in predicates is
+unnecessary work. While all the metadata bytes must still be fetched and 
scanned
+to find the relevant positions, CPUs are quite fast at scanning data, so 
skipping parsing
+unnecessary data can speed up overall metadata parsing significantly.
 
 
-The speedup was achieved by using a custom parser that is optimized for the 
specific
-subset of the Thrift format used by Parquet, and by using various performance 
optimizations.
+[`Statistics`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
+[`ColumnIndex`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163
+[`ColumnChunk`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958
 
 
 <!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
 <div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
-  <img src="{{ site.baseurl }}/img/rust-parquet-metadata/new-pipeline.png" 
width="100%" class="img-responsive" alt="New Parquet Parsing Pipeline" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/rust-parquet-metadata/thrift-parsing-allocations.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
 </div>
 
-Figure 3: New one step process to read parquet metadata: use a custom parser 
to parse the thrift binary directly
-own in-memory representation of parquet metadata.
+*Figure XX*: Generated thrift parsers typically parse into heap allocated 
structures requiring
+in many small heap allocations, which are expensive.
 
-The custom parser is optimized for the specific subset of the Thrift format 
used
-by Parquet, and by using various performance optimizations
 
-### Example optimizations
 
-The obvious optimization is to remove the intermediate step of creating
-the in-memory structures generated by the thrift crate, and instead parse
-the thrift binary directly into arrow's own in-memory representation of 
parquet metadata.
+## New Design: Custom Thrift Parser
 
-However, there are other optimizations that can now be applied such as: (TODO 
GET LIST)
+As is typical of code generated from another tool, opportunities for
+optimization in the code generated by the thrift compilers are typically
+limited. This is because 
 
+1. The generated is not easy to modify (it must be re-generated from from the 
thrift definitions when the definitions change)

Review Comment:
   ```suggestion
   1. The generated code is not easy to modify (it must be re-generated from 
from the thrift definitions when the definitions change)
   ```



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -178,67 +202,155 @@ but this crate is not optimized for speed, and so we 
have been looking for alter
 Figure 2: Two step process to read parquet metadata: use code generated by the 
thrift crate to parse
 the thrift binary into in-memory structures, then convert the in-memory 
structures into arrow's
 
-## New Design: Custom Parser
-
-As is typical of code generated from another tool, oportunities for 
optimization
-are typically limited, both because the generated is not easy to modify.
-(TODO find exmaples of trying to help the thrift crate be faster)
+The parsers generated by standard Thrift compilers typically parse *all* fields
+in a single pass over the thrift encoded bytes, copying data into in-memory, 
heap
+allocated structures such as a Rust [`Vec`], or C++ [`std::vector`], as
+shown in the figure below. This approach is simple and straightforward and a 
good
+choice given Thrift's design point of encoding network messages, which 
typically
+don't send extraneous information. However, reading all fields and allocating
+memory for them is not necessary in many cases when reading Parquet metadata.
 
-For example, the last release of the rust thrift compiler crate 
https://crates.io/crates/thrift/0.17.0
-was three years ago, and the last commit to the repository was over a year ago
-https://crates.io/crates/thrift
-We hope we can also help make the rust thrift crate to be better. 
 
+[`Vec`]: https://doc.rust-lang.org/std/vec/struct.Vec.html
+[`std::vector`]: https://en.cppreference.com/w/cpp/container/vector.html
+[thrift compilers]: https://thrift.apache.org/lib/
 
-So we instead wrote a custom parser that parses the thrift binary directly 
into the needed
-structures.
+For some use cases, such as caching the parsed Parquet metadata, all the fields
+are needed and thus must be parsed. However, for many use cases, parsing the
+entire metadata into in memory structures is wasteful. For example, a query 
that
+reads only 10 columns from a file with 1000 columns with a single column 
predicate
+(e.g. `time > now() - '1 minute'`) only needs [`Statistics`] (or
+[`ColumnIndex`]) for the predicate column and the [`ColumnChunk`] for the 10 
columns. Parsing (allocating and copying)
+statistics for remaining 999 columns which are not used in predicates is
+unnecessary work. While all the metadata bytes must still be fetched and 
scanned
+to find the relevant positions, CPUs are quite fast at scanning data, so 
skipping parsing
+unnecessary data can speed up overall metadata parsing significantly.
 
 
-The speedup was achieved by using a custom parser that is optimized for the 
specific
-subset of the Thrift format used by Parquet, and by using various performance 
optimizations.
+[`Statistics`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
+[`ColumnIndex`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163
+[`ColumnChunk`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958
 
 
 <!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
 <div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
-  <img src="{{ site.baseurl }}/img/rust-parquet-metadata/new-pipeline.png" 
width="100%" class="img-responsive" alt="New Parquet Parsing Pipeline" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/rust-parquet-metadata/thrift-parsing-allocations.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
 </div>
 
-Figure 3: New one step process to read parquet metadata: use a custom parser 
to parse the thrift binary directly
-own in-memory representation of parquet metadata.
+*Figure XX*: Generated thrift parsers typically parse into heap allocated 
structures requiring
+in many small heap allocations, which are expensive.
 
-The custom parser is optimized for the specific subset of the Thrift format 
used
-by Parquet, and by using various performance optimizations
 
-### Example optimizations
 
-The obvious optimization is to remove the intermediate step of creating
-the in-memory structures generated by the thrift crate, and instead parse
-the thrift binary directly into arrow's own in-memory representation of 
parquet metadata.
+## New Design: Custom Thrift Parser
 
-However, there are other optimizations that can now be applied such as: (TODO 
GET LIST)
+As is typical of code generated from another tool, opportunities for
+optimization in the code generated by the thrift compilers are typically
+limited. This is because 
 
+1. The generated is not easy to modify (it must be re-generated from from the 
thrift definitions when the definitions change)
+2. The generated code must be general purpose, and easy to embed, and 
typically has
+   generates structures with a one to one mapping of the thrift definitions, 
+   which limits adding additional optimizations such as zero copy parsing, or 
+   amortized memory allocation strategies.
+3. The thrift compilers are quite stable, which is important to enable 
embedding 
+   generated code. For example, the [last release of the Rust  `thrift` crate] 
+   was almost three years ago.
 
-### Maintainability
-THe largest concern with this approach is that it is more difficult to 
maintain, as any changes to the parquet
-format must be manually reflected in the custom parser, and the thrift 
definitions
-do get updated (e.g. for the recent additions of [Geometry] and [Variant] 
tyoes)
+[last release of the Rust `thrift` crate]: 
https://crates.io/crates/thrift/0.17.0
 
-However, one of the arrow-rs contributors, Jorn Horstmann (todo link) 
prototyped a macro
-based approach, that generates the custom parser code using rust macros so 
that the
-rust structures look very similar to the original thrift definitions. 
+Thus, we concluded that we needed a custom thrift parser that could be 
optimized
+for the specific subset of the Thrift format used by Parquet. 
+So we instead wrote a custom parser that parses the thrift binary directly 
into the needed
+structures as shown in the figure below. 
 
-For exmaple, here is the thrift definition of a parquet file metadata:
 
-(TOD)O
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="{{ site.baseurl }}/img/rust-parquet-metadata/new-pipeline.png" 
width="100%" class="img-responsive" alt="New Parquet Parsing Pipeline" 
aria-hidden="true">
+</div>
 
+Figure 3: New one step process to read parquet metadata: use a custom parser 
to parse the thrift binary directly
+own in-memory representation of parquet metadata.
+
+The custom parser is optimized for the specific subset of the Thrift format 
used
+by Parquet, and by using various performance optimizations, such as careful 
memory 
+allocation and avoiding unnecessary copies. 
 
-And the corresponding Rust structure that is created directly from the custom 
parser:
+The largest initial speedup came from removing the intermediate in-memory
+structures, and instead simply created the needed in-memory representation of
+parquet metadata. We also carefully hand optimized certain thrift decoding and
+memory allocation paths.
 
-(TODO example
-)
+### Maintainability
+
+THe largest concern with this approach is that it is more difficult to 
maintain, as any changes to the [`parquet.thrift`] file 
+must be manually reflected in the custom parser, and the thrift definitions
+do get updated regularly (e.g. for the recent additions of [Geospatial] and 
[Variant] types)
+
+[Geospatial]: 
https://github.com/apache/parquet-format/blob/master/Geospatial.md
+[Variant]: 
https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
+
+However, [Jörn Horstmann] prototyped a [Rust macro
+based approach] that generates parsing code for annotated Rust structs
+that closely resemble the thrift definitions. Structs that need additional 
optimization
+can be manually implemented. This approach is similar to how the [serde] crate
+generates serialization and deserialization code for Rust structs.
+
+[Jörn Horstmann]: https://github.com/jhorstmann
+[Rust macro based approach]: https://github.com/jhorstmann/compact-thrift
+[serde]: https://serde.rs/
+
+<!-- https://x.com/jhorstmann23/status/1803426667748053448--> 
+
+For example, here is the thrift definition of the [`FileMetaData`] structure 
(comments omitted for brevity):
+
+[`FileMetaData`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1254C1-L1314C2
+
+```thrift
+struct FileMetaData {
+  1: required i32 version
+  2: required list<SchemaElement> schema;
+  3: required i64 num_rows
+  4: required list<RowGroup> row_groups
+  5: optional list<KeyValue> key_value_metadata
+  6: optional string created_by
+  7: optional list<ColumnOrder> column_orders;
+  8: optional EncryptionAlgorithm encryption_algorithm
+  9: optional binary footer_signing_key_metadata
+}
+```
+
+And here is the corresponding Rust structure annotated with the custom thrift 
parsing macros:
+
+https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/file/metadata/thrift_gen.rs#L146-L158
+
+```rust
+thrift_struct!(
+struct FileMetaData<'a> {
+1: required i32 version
+2: required list<'a><SchemaElement> schema;
+3: required i64 num_rows
+4: required list<'a><RowGroup> row_groups
+5: optional list<KeyValue> key_value_metadata
+6: optional string<'a> created_by
+7: optional list<ColumnOrder> column_orders;
+8: optional EncryptionAlgorithm encryption_algorithm
+9: optional binary<'a> footer_signing_key_metadata
+}
+);
+```
+
+This organization makes it easy to see the correspondence between the thrift 
definition
+and the Rust structure, and makes it easy to update the Rust structure when the
+thrift definition changes.

Review Comment:
   While true for objects we don't wish to customize, for the heavy hitting 
structures (FileMetaData, RowGroupMetaData, ColumnChunkMetaData) we actually 
skip using the macros and hand write those (but retain the ability to add new 
structures like the GeospatialStatistics with relative ease).



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -137,38 +137,62 @@ decoder implementation.
 
 
 ## Background: Apache Thrift
+
+[Apache Thrift] is a framework for defining network data types and service
+interfaces and includes a data definition language similar [Protocol Buffers].

Review Comment:
   ```suggestion
   interfaces and includes a data definition language similar to [Protocol 
Buffers].
   ```



##########
_posts/2025-10-30-rust-parquet-metadata.md:
##########
@@ -178,67 +202,155 @@ but this crate is not optimized for speed, and so we 
have been looking for alter
 Figure 2: Two step process to read parquet metadata: use code generated by the 
thrift crate to parse
 the thrift binary into in-memory structures, then convert the in-memory 
structures into arrow's
 
-## New Design: Custom Parser
-
-As is typical of code generated from another tool, oportunities for 
optimization
-are typically limited, both because the generated is not easy to modify.
-(TODO find exmaples of trying to help the thrift crate be faster)
+The parsers generated by standard Thrift compilers typically parse *all* fields
+in a single pass over the thrift encoded bytes, copying data into in-memory, 
heap
+allocated structures such as a Rust [`Vec`], or C++ [`std::vector`], as
+shown in the figure below. This approach is simple and straightforward and a 
good
+choice given Thrift's design point of encoding network messages, which 
typically
+don't send extraneous information. However, reading all fields and allocating
+memory for them is not necessary in many cases when reading Parquet metadata.
 
-For example, the last release of the rust thrift compiler crate 
https://crates.io/crates/thrift/0.17.0
-was three years ago, and the last commit to the repository was over a year ago
-https://crates.io/crates/thrift
-We hope we can also help make the rust thrift crate to be better. 
 
+[`Vec`]: https://doc.rust-lang.org/std/vec/struct.Vec.html
+[`std::vector`]: https://en.cppreference.com/w/cpp/container/vector.html
+[thrift compilers]: https://thrift.apache.org/lib/
 
-So we instead wrote a custom parser that parses the thrift binary directly 
into the needed
-structures.
+For some use cases, such as caching the parsed Parquet metadata, all the fields
+are needed and thus must be parsed. However, for many use cases, parsing the
+entire metadata into in memory structures is wasteful. For example, a query 
that
+reads only 10 columns from a file with 1000 columns with a single column 
predicate
+(e.g. `time > now() - '1 minute'`) only needs [`Statistics`] (or
+[`ColumnIndex`]) for the predicate column and the [`ColumnChunk`] for the 10 
columns. Parsing (allocating and copying)
+statistics for remaining 999 columns which are not used in predicates is
+unnecessary work. While all the metadata bytes must still be fetched and 
scanned
+to find the relevant positions, CPUs are quite fast at scanning data, so 
skipping parsing
+unnecessary data can speed up overall metadata parsing significantly.
 
 
-The speedup was achieved by using a custom parser that is optimized for the 
specific
-subset of the Thrift format used by Parquet, and by using various performance 
optimizations.
+[`Statistics`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912
+[`ColumnIndex`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163
+[`ColumnChunk`]: 
https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958
 
 
 <!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
 <div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
-  <img src="{{ site.baseurl }}/img/rust-parquet-metadata/new-pipeline.png" 
width="100%" class="img-responsive" alt="New Parquet Parsing Pipeline" 
aria-hidden="true">
+  <img src="{{ site.baseurl 
}}/img/rust-parquet-metadata/thrift-parsing-allocations.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
 </div>
 
-Figure 3: New one step process to read parquet metadata: use a custom parser 
to parse the thrift binary directly
-own in-memory representation of parquet metadata.
+*Figure XX*: Generated thrift parsers typically parse into heap allocated 
structures requiring
+in many small heap allocations, which are expensive.
 
-The custom parser is optimized for the specific subset of the Thrift format 
used
-by Parquet, and by using various performance optimizations
 
-### Example optimizations
 
-The obvious optimization is to remove the intermediate step of creating
-the in-memory structures generated by the thrift crate, and instead parse
-the thrift binary directly into arrow's own in-memory representation of 
parquet metadata.
+## New Design: Custom Thrift Parser
 
-However, there are other optimizations that can now be applied such as: (TODO 
GET LIST)
+As is typical of code generated from another tool, opportunities for
+optimization in the code generated by the thrift compilers are typically
+limited. This is because 
 
+1. The generated is not easy to modify (it must be re-generated from from the 
thrift definitions when the definitions change)
+2. The generated code must be general purpose, and easy to embed, and 
typically has
+   generates structures with a one to one mapping of the thrift definitions, 

Review Comment:
   ```suggestion
      generated structures with a one to one mapping of the thrift definitions, 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Website]: Blog post about new Rust Parquet Metadata parser [arrow-site]

Reply via email to