(arrow-site) branch asf-site updated: Updating built site

github-bot Thu, 23 Oct 2025 09:29:23 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-site.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 539b3c108ff Updating built site
539b3c108ff is described below

commit 539b3c108ffe5ef50cbdfbf8631858558cabff79
Author: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
AuthorDate: Thu Oct 23 16:29:06 2025 +0000

    Updating built site
---
 blog/2025/10/23/rust-parquet-metadata/index.html   | 555 +++++++++++++++++++++
 blog/index.html                                    |  25 +
 feed.xml                                           | 326 +++++++++---
 img/rust-parquet-metadata/flow.png                 | Bin 0 -> 277891 bytes
 img/rust-parquet-metadata/new-pipeline.png         | Bin 0 -> 406276 bytes
 img/rust-parquet-metadata/original-pipeline.png    | Bin 0 -> 403736 bytes
 img/rust-parquet-metadata/parquet.png              | Bin 0 -> 78726 bytes
 img/rust-parquet-metadata/results.png              | Bin 0 -> 78434 bytes
 img/rust-parquet-metadata/scaling.png              | Bin 0 -> 48806 bytes
 .../thrift-compact-encoding.png                    | Bin 0 -> 392751 bytes
 .../thrift-parsing-allocations.png                 | Bin 0 -> 585858 bytes
 11 files changed, 845 insertions(+), 61 deletions(-)

diff --git a/blog/2025/10/23/rust-parquet-metadata/index.html 
b/blog/2025/10/23/rust-parquet-metadata/index.html
new file mode 100644
index 00000000000..17edbd1189d
--- /dev/null
+++ b/blog/2025/10/23/rust-parquet-metadata/index.html
@@ -0,0 +1,555 @@
+<!DOCTYPE html>
+<html lang="en-US">
+  <head>
+    <meta charset="UTF-8">
+    <meta http-equiv="X-UA-Compatible" content="IE=edge">
+    <meta name="viewport" content="width=device-width, initial-scale=1">
+    <!-- The above meta tags *must* come first in the head; any other head 
content must come *after* these tags -->
+    
+    <title>3x-9x Faster Apache Parquet Footer Metadata Using a Custom Thrift 
Parser in Rust | Apache Arrow</title>
+    
+
+    <!-- Begin Jekyll SEO tag v2.8.0 -->
+<meta name="generator" content="Jekyll v4.4.1" />
+<meta property="og:title" content="3x-9x Faster Apache Parquet Footer Metadata 
Using a Custom Thrift Parser in Rust" />
+<meta name="author" content="alamb" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="Editor’s Note: While Apache Arrow and Apache 
Parquet are separate projects, the Arrow arrow-rs repository hosts the 
development of the parquet Rust crate, a widely used and high-performance 
Parquet implementation. Summary Version 57.0.0 of the parquet Rust crate 
decodes metadata more than three times faster than previous versions thanks to 
a new custom Apache Thrift parser. The new parser is both faster in all cases 
and enables further performance improv [...]
+<meta property="og:description" content="Editor’s Note: While Apache Arrow and 
Apache Parquet are separate projects, the Arrow arrow-rs repository hosts the 
development of the parquet Rust crate, a widely used and high-performance 
Parquet implementation. Summary Version 57.0.0 of the parquet Rust crate 
decodes metadata more than three times faster than previous versions thanks to 
a new custom Apache Thrift parser. The new parser is both faster in all cases 
and enables further performance [...]
+<link rel="canonical" 
href="https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/"; />
+<meta property="og:url" 
content="https://arrow.apache.org/blog/2025/10/23/rust-parquet-metadata/"; />
+<meta property="og:site_name" content="Apache Arrow" />
+<meta property="og:image" 
content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png";
 />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2025-10-23T00:00:00-04:00" />
+<meta name="twitter:card" content="summary_large_image" />
+<meta property="twitter:image" 
content="https://arrow.apache.org/img/arrow-logo_horizontal_black-txt_white-bg.png";
 />
+<meta property="twitter:title" content="3x-9x Faster Apache Parquet Footer 
Metadata Using a Custom Thrift Parser in Rust" />
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"alamb"},"dateModified":"2025-10-23T00:00:00-04:00","datePublished":"2025-10-23T00:00:00-04:00","description":"Editor’s
 Note: While Apache Arrow and Apache Parquet are separate projects, the Arrow 
arrow-rs repository hosts the development of the parquet Rust crate, a widely 
used and high-performance Parquet implementation. Summary Version 57.0.0 of the 
parquet Rust crate decodes metadata more than th [...]
+<!-- End Jekyll SEO tag -->
+
+
+    <!-- favicons -->
+    <link rel="icon" type="image/png" sizes="16x16" 
href="/img/favicon-16x16.png" id="light1">
+    <link rel="icon" type="image/png" sizes="32x32" 
href="/img/favicon-32x32.png" id="light2">
+    <link rel="apple-touch-icon" type="image/png" sizes="180x180" 
href="/img/apple-touch-icon.png" id="light3">
+    <link rel="apple-touch-icon" type="image/png" sizes="120x120" 
href="/img/apple-touch-icon-120x120.png" id="light4">
+    <link rel="apple-touch-icon" type="image/png" sizes="76x76" 
href="/img/apple-touch-icon-76x76.png" id="light5">
+    <link rel="apple-touch-icon" type="image/png" sizes="60x60" 
href="/img/apple-touch-icon-60x60.png" id="light6">
+    <!-- dark mode favicons -->
+    <link rel="icon" type="image/png" sizes="16x16" 
href="/img/favicon-16x16-dark.png" id="dark1">
+    <link rel="icon" type="image/png" sizes="32x32" 
href="/img/favicon-32x32-dark.png" id="dark2">
+    <link rel="apple-touch-icon" type="image/png" sizes="180x180" 
href="/img/apple-touch-icon-dark.png" id="dark3">
+    <link rel="apple-touch-icon" type="image/png" sizes="120x120" 
href="/img/apple-touch-icon-120x120-dark.png" id="dark4">
+    <link rel="apple-touch-icon" type="image/png" sizes="76x76" 
href="/img/apple-touch-icon-76x76-dark.png" id="dark5">
+    <link rel="apple-touch-icon" type="image/png" sizes="60x60" 
href="/img/apple-touch-icon-60x60-dark.png" id="dark6">
+
+    <script>
+      // Switch to the dark-mode favicons if prefers-color-scheme: dark
+      function onUpdate() {
+        light1 = document.querySelector('link#light1');
+        light2 = document.querySelector('link#light2');
+        light3 = document.querySelector('link#light3');
+        light4 = document.querySelector('link#light4');
+        light5 = document.querySelector('link#light5');
+        light6 = document.querySelector('link#light6');
+
+        dark1 = document.querySelector('link#dark1');
+        dark2 = document.querySelector('link#dark2');
+        dark3 = document.querySelector('link#dark3');
+        dark4 = document.querySelector('link#dark4');
+        dark5 = document.querySelector('link#dark5');
+        dark6 = document.querySelector('link#dark6');
+
+        if (matcher.matches) {
+          light1.remove();
+          light2.remove();
+          light3.remove();
+          light4.remove();
+          light5.remove();
+          light6.remove();
+          document.head.append(dark1);
+          document.head.append(dark2);
+          document.head.append(dark3);
+          document.head.append(dark4);
+          document.head.append(dark5);
+          document.head.append(dark6);
+        } else {
+          dark1.remove();
+          dark2.remove();
+          dark3.remove();
+          dark4.remove();
+          dark5.remove();
+          dark6.remove();
+          document.head.append(light1);
+          document.head.append(light2);
+          document.head.append(light3);
+          document.head.append(light4);
+          document.head.append(light5);
+          document.head.append(light6);
+        }
+      }
+      matcher = window.matchMedia('(prefers-color-scheme: dark)');
+      matcher.addListener(onUpdate);
+      onUpdate();
+    </script>
+
+    <link href="/css/main.css" rel="stylesheet">
+    <link href="/css/syntax.css" rel="stylesheet">
+    <script src="/javascript/main.js"></script>
+    
+    <!-- Matomo -->
+<script>
+  var _paq = window._paq = window._paq || [];
+  /* tracker methods like "setCustomDimension" should be called before 
"trackPageView" */
+  /* We explicitly disable cookie tracking to avoid privacy issues */
+  _paq.push(['disableCookies']);
+  _paq.push(['trackPageView']);
+  _paq.push(['enableLinkTracking']);
+  (function() {
+    var u="https://analytics.apache.org/";;
+    _paq.push(['setTrackerUrl', u+'matomo.php']);
+    _paq.push(['setSiteId', '20']);
+    var d=document, g=d.createElement('script'), 
s=d.getElementsByTagName('script')[0];
+    g.async=true; g.src=u+'matomo.js'; s.parentNode.insertBefore(g,s);
+  })();
+</script>
+<!-- End Matomo Code -->
+
+    
+    <link type="application/atom+xml" rel="alternate" 
href="https://arrow.apache.org/feed.xml"; title="Apache Arrow" />
+  </head>
+
+
+<body class="wrap">
+  <header>
+    <nav class="navbar navbar-expand-md navbar-dark bg-dark">
+  
+  <a class="navbar-brand no-padding" href="/"><img 
src="/img/arrow-inverse-300px.png" height="40px"></a>
+  
+   <button class="navbar-toggler ml-auto" type="button" data-toggle="collapse" 
data-target="#arrow-navbar" aria-controls="arrow-navbar" aria-expanded="false" 
aria-label="Toggle navigation">
+    <span class="navbar-toggler-icon"></span>
+  </button>
+
+    <!-- Collect the nav links, forms, and other content for toggling -->
+    <div class="collapse navbar-collapse justify-content-end" 
id="arrow-navbar">
+      <ul class="nav navbar-nav">
+        <li class="nav-item"><a class="nav-link" href="/overview/" 
role="button" aria-haspopup="true" aria-expanded="false">Overview</a></li>
+        <li class="nav-item"><a class="nav-link" href="/faq/" role="button" 
aria-haspopup="true" aria-expanded="false">FAQ</a></li>
+        <li class="nav-item"><a class="nav-link" href="/blog" role="button" 
aria-haspopup="true" aria-expanded="false">Blog</a></li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" 
id="navbarDropdownGetArrow" role="button" data-toggle="dropdown" 
aria-haspopup="true" aria-expanded="false">
+             Get Arrow
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownGetArrow">
+            <a class="dropdown-item" href="/install/">Install</a>
+            <a class="dropdown-item" href="/release/">Releases</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" 
id="navbarDropdownDocumentation" role="button" data-toggle="dropdown" 
aria-haspopup="true" aria-expanded="false">
+             Docs
+          </a>
+          <div class="dropdown-menu" 
aria-labelledby="navbarDropdownDocumentation">
+            <a class="dropdown-item" href="/docs">Project Docs</a>
+            <a class="dropdown-item" 
href="/docs/format/Columnar.html">Format</a>
+            <hr>
+            <a class="dropdown-item" href="/docs/c_glib">C GLib</a>
+            <a class="dropdown-item" href="/docs/cpp">C++</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/blob/main/csharp/README.md"; 
target="_blank" rel="noopener">C#</a>
+            <a class="dropdown-item" 
href="https://godoc.org/github.com/apache/arrow/go/arrow"; target="_blank" 
rel="noopener">Go</a>
+            <a class="dropdown-item" href="/docs/java">Java</a>
+            <a class="dropdown-item" href="/docs/js">JavaScript</a>
+            <a class="dropdown-item" href="/julia/">Julia</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/blob/main/matlab/README.md"; 
target="_blank" rel="noopener">MATLAB</a>
+            <a class="dropdown-item" href="/docs/python">Python</a>
+            <a class="dropdown-item" href="/docs/r">R</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/blob/main/ruby/README.md"; target="_blank" 
rel="noopener">Ruby</a>
+            <a class="dropdown-item" href="https://docs.rs/arrow/latest"; 
target="_blank" rel="noopener">Rust</a>
+            <a class="dropdown-item" href="/swift">Swift</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" 
id="navbarDropdownSource" role="button" data-toggle="dropdown" 
aria-haspopup="true" aria-expanded="false">
+             Source
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownSource">
+            <a class="dropdown-item" href="https://github.com/apache/arrow"; 
target="_blank" rel="noopener">Main Repo</a>
+            <hr>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/c_glib"; target="_blank" 
rel="noopener">C GLib</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/cpp"; target="_blank" 
rel="noopener">C++</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/csharp"; target="_blank" 
rel="noopener">C#</a>
+            <a class="dropdown-item" href="https://github.com/apache/arrow-go"; 
target="_blank" rel="noopener">Go</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow-java"; target="_blank" 
rel="noopener">Java</a>
+            <a class="dropdown-item" href="https://github.com/apache/arrow-js"; 
target="_blank" rel="noopener">JavaScript</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow-julia"; target="_blank" 
rel="noopener">Julia</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/matlab"; target="_blank" 
rel="noopener">MATLAB</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/python"; target="_blank" 
rel="noopener">Python</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/r"; target="_blank" 
rel="noopener">R</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/tree/main/ruby"; target="_blank" 
rel="noopener">Ruby</a>
+            <a class="dropdown-item" href="https://github.com/apache/arrow-rs"; 
target="_blank" rel="noopener">Rust</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow-swift"; target="_blank" 
rel="noopener">Swift</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" 
id="navbarDropdownSubprojects" role="button" data-toggle="dropdown" 
aria-haspopup="true" aria-expanded="false">
+             Subprojects
+          </a>
+          <div class="dropdown-menu" 
aria-labelledby="navbarDropdownSubprojects">
+            <a class="dropdown-item" href="/adbc">ADBC</a>
+            <a class="dropdown-item" href="/docs/format/Flight.html">Arrow 
Flight</a>
+            <a class="dropdown-item" href="/docs/format/FlightSql.html">Arrow 
Flight SQL</a>
+            <a class="dropdown-item" href="https://datafusion.apache.org"; 
target="_blank" rel="noopener">DataFusion</a>
+            <a class="dropdown-item" href="/nanoarrow">nanoarrow</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" 
id="navbarDropdownCommunity" role="button" data-toggle="dropdown" 
aria-haspopup="true" aria-expanded="false">
+             Community
+          </a>
+          <div class="dropdown-menu" aria-labelledby="navbarDropdownCommunity">
+            <a class="dropdown-item" href="/community/">Communication</a>
+            <a class="dropdown-item" 
href="/docs/developers/index.html">Contributing</a>
+            <a class="dropdown-item" 
href="https://github.com/apache/arrow/issues"; target="_blank" 
rel="noopener">Issue Tracker</a>
+            <a class="dropdown-item" href="/committers/">Governance</a>
+            <a class="dropdown-item" href="/use_cases/">Use Cases</a>
+            <a class="dropdown-item" href="/powered_by/">Powered By</a>
+            <a class="dropdown-item" href="/visual_identity/">Visual 
Identity</a>
+            <a class="dropdown-item" href="/security/">Security</a>
+            <a class="dropdown-item" 
href="https://www.apache.org/foundation/policies/conduct.html"; target="_blank" 
rel="noopener">Code of Conduct</a>
+          </div>
+        </li>
+        <li class="nav-item dropdown">
+          <a class="nav-link dropdown-toggle" href="#" id="navbarDropdownASF" 
role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
+             ASF Links
+          </a>
+          <div class="dropdown-menu dropdown-menu-right" 
aria-labelledby="navbarDropdownASF">
+            <a class="dropdown-item" href="https://www.apache.org/"; 
target="_blank" rel="noopener">ASF Website</a>
+            <a class="dropdown-item" href="https://www.apache.org/licenses/"; 
target="_blank" rel="noopener">License</a>
+            <a class="dropdown-item" 
href="https://www.apache.org/foundation/sponsorship.html"; target="_blank" 
rel="noopener">Donate</a>
+            <a class="dropdown-item" 
href="https://www.apache.org/foundation/thanks.html"; target="_blank" 
rel="noopener">Thanks</a>
+            <a class="dropdown-item" href="https://www.apache.org/security/"; 
target="_blank" rel="noopener">Security</a>
+          </div>
+        </li>
+      </ul>
+    </div>
+<!-- /.navbar-collapse -->
+  </nav>
+
+  </header>
+
+  <div class="container p-4 pt-5">
+    <div class="col-md-8 mx-auto">
+      <main role="main" class="pb-5">
+        
+<h1>
+  3x-9x Faster Apache Parquet Footer Metadata Using a Custom Thrift Parser in 
Rust
+</h1>
+<hr class="mt-4 mb-3">
+
+
+
+<p class="mb-4 pb-1">
+  <span class="badge badge-secondary">Published</span>
+  <span class="published mr-3">
+    23 Oct 2025
+  </span>
+  <br>
+  <span class="badge badge-secondary">By</span>
+  
+    <a class="mr-3" href="https://github.com/alamb"; target="_blank" 
rel="noopener">Andrew Lamb (alamb) </a>
+  
+
+  
+</p>
+
+
+        <!--
+
+-->
+<p><em>Editor’s Note: While <a href="https://arrow.apache.org/";>Apache 
Arrow</a> and <a href="https://parquet.apache.org/"; target="_blank" 
rel="noopener">Apache Parquet</a> are separate projects,
+the Arrow <a href="https://github.com/apache/arrow-rs"; target="_blank" 
rel="noopener">arrow-rs</a> repository hosts the development of the <a 
href="https://crates.io/crates/parquet"; target="_blank" 
rel="noopener">parquet</a> Rust
+crate, a widely used and high-performance Parquet implementation.</em></p>
+<h2>Summary</h2>
+<p>Version <a href="https://crates.io/crates/parquet/57.0.0"; target="_blank" 
rel="noopener">57.0.0</a> of the <a href="https://crates.io/crates/parquet"; 
target="_blank" rel="noopener">parquet</a> Rust crate decodes metadata more 
than three times
+faster than previous versions thanks to a new custom <a 
href="https://thrift.apache.org/"; target="_blank" rel="noopener">Apache 
Thrift</a> parser. The new
+parser is both faster in all cases and enables further performance 
improvements not
+possible with generated parsers, such as skipping unnecessary fields and 
selective parsing.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/results.png" width="100%" 
class="img-responsive" alt="" aria-hidden="true">
+</div>
+<p><em>Figure 1:</em> Performance comparison of <a 
href="https://parquet.apache.org/"; target="_blank" rel="noopener">Apache 
Parquet</a> metadata parsing using a generated
+Thrift parser (versions <code>56.2.0</code> and earlier) and the new
+<a href="https://github.com/apache/arrow-rs/issues/5854"; target="_blank" 
rel="noopener">custom Thrift parser</a> in <a 
href="https://github.com/apache/arrow-rs"; target="_blank" 
rel="noopener">arrow-rs</a> version <a 
href="https://crates.io/crates/parquet/57.0.0"; target="_blank" 
rel="noopener">57.0.0</a>. No
+changes are needed to the Parquet format itself.
+See the <a href="https://github.com/alamb/parquet_footer_parsing"; 
target="_blank" rel="noopener">benchmark page</a> for more details.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/scaling.png" width="100%" 
class="img-responsive" alt="Scaling behavior of custom Thrift parser" 
aria-hidden="true">
+</div>
+<p><em>Figure 2:</em> Speedup of the [custom Thrift decoder] for string and 
floating-point data types,
+for <code>100</code>, <code>1000</code>, <code>10,000</code>, and 
<code>100,000</code> columns. The new parser is faster in all cases,
+and the speedup is similar regardless of the number of columns. See the <a 
href="https://github.com/alamb/parquet_footer_parsing"; target="_blank" 
rel="noopener">benchmark page</a> for more details.</p>
+<h2>Introduction: Parquet and the Importance of Metadata Parsing</h2>
+<p><a href="https://parquet.apache.org/"; target="_blank" rel="noopener">Apache 
Parquet</a> is a popular columnar storage format
+designed to be efficient for both storage and query processing. Parquet
+files consist of a series of data pages, and a footer, as shown in Figure 3. 
The footer
+contains metadata about the file, including schema, statistics, and other
+information needed to decode the data pages.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/parquet.png" width="100%" 
class="img-responsive" alt="Physical File Structure of Parquet" 
aria-hidden="true">
+</div>
+<p><em>Figure 3:</em> Structure of a Parquet file showing the header, data 
pages, and footer metadata.</p>
+<p>Getting information stored in the footer is typically the first step in 
reading
+a Parquet file, as it is required to interpret the data pages. 
<em>Parsing</em> the
+footer is often performance critical:</p>
+<ul>
+<li>When reading from fast local storage, such as modern NVMe SSDs, footer 
parsing
+must be completed to know what data pages to read, placing it directly on the 
critical
+I/O path.</li>
+<li>Footer parsing scales linearly with the number of columns and row groups 
in a
+Parquet file and thus can be a bottleneck for tables with many columns or files
+with many row groups.</li>
+<li>Even in systems that cache the parsed footer in memory (see <a 
href="https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/"; 
target="_blank" rel="noopener">Using
+External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries
+on Apache Parquet</a>), the footer must still be parsed on cache miss.</li>
+</ul>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/flow.png" width="100%" 
class="img-responsive" alt="Typical Parquet processing flow" aria-hidden="true">
+</div>
+<p><em>Figure 4:</em> Typical processing flow for Parquet files for stateless 
and stateful
+systems. Stateless engines read the footer on every query, so the time taken to
+parse the footer directly adds to query latency. Stateful systems cache some or
+all of the parsed footer in advance of queries.</p>
+<p>The speed of parsing metadata has grown even more important as Parquet 
spreads
+throughout the data ecosystem and is used for more latency-sensitive workloads 
such
+as observability, interactive analytics, and single-point
+lookups for Retrieval-Augmented Generation (RAG) applications feeding LLMs.
+As overall query times decrease, the proportion spent on footer parsing 
increases.</p>
+<h2>Background: Apache Thrift</h2>
+<p>Parquet stores metadata using <a href="https://thrift.apache.org/"; 
target="_blank" rel="noopener">Apache Thrift</a>, a framework for
+network data types and service interfaces. It includes a <a 
href="https://thrift.apache.org/docs/idl"; target="_blank" rel="noopener">data 
definition
+language</a> similar to <a 
href="https://developers.google.com/protocol-buffers"; target="_blank" 
rel="noopener">Protocol Buffers</a>. Thrift definition files describe data
+types in a language-neutral way, and systems typically use code generators to
+automatically create code for a specific programming language to read and write
+those data types.</p>
+<p>The <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";
 target="_blank" rel="noopener">parquet.thrift</a> file defines the format of 
the metadata
+serialized at the end of each Parquet file in the <a 
href="https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md";
 target="_blank" rel="noopener">Thrift Compact
+protocol</a>, as shown below in Figure 5. The binary encoding is 
"variable-length",
+meaning that the length of each element depends on its content, not
+just its type. Smaller-valued primitive types are encoded in fewer bytes than
+larger values, and strings and lists are stored inline, prefixed with their
+length.</p>
+<p>This encoding is space-efficient but, due to being variable-length, does not
+support random access: it is not possible to locate a particular field without
+scanning all previous fields. Other formats such as <a 
href="https://google.github.io/flatbuffers/"; target="_blank" 
rel="noopener">FlatBuffers</a> provide
+random-access parsing and have been <a 
href="https://lists.apache.org/thread/j9qv5vyg0r4jk6tbm6sqthltly4oztd3"; 
target="_blank" rel="noopener">proposed as alternatives</a> given their
+theoretical performance advantages. However, changing the Parquet format is a
+significant undertaking, requires buy-in from the community and ecosystem,
+and would likely take years to be adopted.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/thrift-compact-encoding.png" 
width="100%" class="img-responsive" alt="Thrift Compact Encoding Illustration" 
aria-hidden="true">
+</div>
+<p><em>Figure 5:</em> Parquet metadata is serialized using the <a 
href="https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md";
 target="_blank" rel="noopener">Thrift Compact protocol</a>.
+Each field is stored using a variable number of bytes that depends on its 
value.
+Primitive types use a variable-length encoding and strings and lists are
+prefixed with their lengths.</p>
+<p>Despite Thrift's very real disadvantage due to lack of random access, 
software
+optimizations are much easier to deploy than format changes. <a 
href="https://xiangpeng.systems/"; target="_blank" rel="noopener">Xiangpeng 
Hao</a>'s
+previous analysis theorized significant (2x–4x) potential performance
+improvements simply by optimizing the implementation of Parquet footer parsing
+(see <a href="https://www.influxdata.com/blog/how-good-parquet-wide-tables/"; 
target="_blank" rel="noopener">How Good is Parquet for Wide Tables (Machine 
Learning
+Workloads) Really?</a> for more details).</p>
+<h2>Processing Thrift Using Generated Parsers</h2>
+<p><em>Parsing</em> Parquet metadata is the process of decoding the 
Thrift-encoded bytes
+into in-memory structures that can be used for computation. Most Parquet
+implementations use one of the existing <a 
href="https://thrift.apache.org/lib/"; target="_blank" rel="noopener">Thrift 
compilers</a> to generate a parser
+that converts Thrift binary data into generated code structures, and then copy
+relevant portions of those generated structures into API-level structures.
+For example, the <a 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet";
 target="_blank" rel="noopener">C/C++ Parquet implementation</a> includes a <a 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/build-support/update-thrift.sh#L23";
 target="_blank" rel="noopener">two</a>-<a 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet/thrift_internal.h#L56";
 targ [...]
+as does <a 
href="https://github.com/apache/parquet-java/blob/0fea3e1e22fffb0a25193e3efb9a5d090899458a/parquet-format-structures/pom.xml#L69-L88";
 target="_blank" rel="noopener">parquet-java</a>. <a 
href="https://github.com/duckdb/duckdb/blob/8f512187537c65d36ce6d6f562b75a37e8d4ee54/third_party/parquet/parquet_types.h#L1-L6";
 target="_blank" rel="noopener">DuckDB</a> also contains a Thrift 
compiler–generated
+parser.</p>
+<p>In versions <code>56.2.0</code> and earlier, the Apache Arrow Rust 
implementation used the
+same pattern. The <a 
href="https://docs.rs/parquet/56.2.0/parquet/format/index.html"; target="_blank" 
rel="noopener">format</a> module contains a parser generated by the <a 
href="https://crates.io/crates/thrift"; target="_blank" rel="noopener">thrift
+crate</a> and the <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";
 target="_blank" rel="noopener">parquet.thrift</a> definition. Parsing metadata 
involves:</p>
+<ol>
+<li>Invoke the generated parser on the Thrift binary data, producing
+generated in-memory structures (e.g., <a 
href="https://docs.rs/parquet/56.2.0/parquet/format/struct.FileMetaData.html"; 
target="_blank" rel="noopener"><code>struct FileMetaData</code></a>), then</li>
+<li>Copy the relevant fields into a more user-friendly representation,
+<a 
href="https://docs.rs/parquet/56.2.0/parquet/file/metadata/struct.ParquetMetaData.html";
 target="_blank" rel="noopener"><code>ParquetMetadata</code></a>.</li>
+</ol>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/original-pipeline.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
+</div>
+<p><em>Figure 6:</em> Two-step process to read Parquet metadata: A parser 
created with the
+<code>thrift</code> crate and <code>parquet.thrift</code> parses the metadata 
bytes
+into generated in-memory structures. These structures are then converted into
+API objects.</p>
+<p>The parsers generated by standard Thrift compilers typically parse 
<em>all</em> fields
+in a single pass over the Thrift-encoded bytes, copying data into in-memory,
+heap-allocated structures (e.g., Rust <a 
href="https://doc.rust-lang.org/std/vec/struct.Vec.html"; target="_blank" 
rel="noopener"><code>Vec</code></a>, or C++ <a 
href="https://en.cppreference.com/w/cpp/container/vector.html"; target="_blank" 
rel="noopener"><code>std::vector</code></a>) as shown
+in Figure 7 below.</p>
+<p>Parsing all fields is straightforward and a good default
+choice given Thrift's original design goal of encoding network messages.
+Network messages typically don't contain extra information irrelevant for 
receivers;
+however, Parquet metadata often <em>does</em> contain information
+that is not needed for a particular query. In such cases, parsing the entire
+metadata into in-memory structures is wasteful.</p>
+<p>For example, a query on a file with 1,000 columns that reads
+only 10 columns and has a single column predicate
+(e.g., <code>time &gt; now() - '1 minute'</code>) only needs</p>
+<ol>
+<li>
+<a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912";
 target="_blank" rel="noopener"><code>Statistics</code></a> (or <a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163";
 target="_blank" rel="noopener"><code>ColumnIndex</code></a>) for the 
<code>time</code> column</li>
+<li>
+<a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958";
 target="_blank" rel="noopener"><code>ColumnChunk</code></a> information for 
the 10 selected columns</li>
+</ol>
+<p>The default strategy to parse (allocating and copying) all statistics and 
all
+<code>ColumnChunks</code> results in creating 999 more statistics and 990 more 
<code>ColumnChunks</code>
+than necessary. As discussed above, given the
+variable encoding used for the metadata, all metadata bytes must still be
+fetched and scanned; however, CPUs are (very) fast at scanning data, and
+skipping <em>parsing</em> of unneeded fields speeds up overall metadata 
performance
+significantly.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/thrift-parsing-allocations.png" 
width="100%" class="img-responsive" alt="Thrift Parsing Allocations" 
aria-hidden="true">
+</div>
+<p><em>Figure 7:</em> Generated Thrift parsers typically parse encoded bytes 
into
+structures requiring many small heap allocations, which are expensive.</p>
+<h2>New Design: Custom Thrift Parser</h2>
+<p>As is typical of generated code, opportunities for specializing
+the behavior of generated Thrift parsers is limited:</p>
+<ol>
+<li>It is not easy to modify (it is re-generated from the
+Thrift definitions when they change and carries the warning
+<code>/* DO NOT EDIT UNLESS YOU ARE SURE THAT YOU KNOW WHAT YOU ARE DOING 
*/</code>).</li>
+<li>It typically maps one-to-one with Thrift definitions, limiting
+additional optimizations such as zero-copy parsing, field
+skipping, and amortized memory allocation strategies.</li>
+<li>Its API is very stable (hard to change), which is important for easy 
maintenance when a large number
+of projects are built using the <a href="https://crates.io/crates/thrift"; 
target="_blank" rel="noopener">thrift crate</a>. For example, the
+<a href="https://crates.io/crates/thrift/0.17.0"; target="_blank" 
rel="noopener">last release of the Rust <code>thrift</code> crate</a> was 
almost three years ago at
+the time of this writing.</li>
+</ol>
+<p>These limitations are a consequence of the Thrift project's design goals: 
general purpose
+code that is easy to embed in a wide variety of other projects, rather than
+any fundamental limitation of the Thrift format.
+Given our goal of fast Parquet metadata parsing, we needed
+a custom, easier to optimize parser, to convert Thrift binary directly into 
the needed
+structures (Figure 8). Since arrow-rs already did some postprocessing on the 
generated code
+and included a custom implementation of the compact protocol api, this change
+to a completely custom parser was a natural next step.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/new-pipeline.png" width="100%" 
class="img-responsive" alt="New Parquet Parsing Pipeline" aria-hidden="true">
+</div>
+<p><em>Figure 8:</em> One-step Parquet metadata parsing using a custom Thrift 
parser. The
+Thrift binary is parsed directly into the desired in-memory representation with
+highly optimized code.</p>
+<p>Our new custom parser is optimized for the specific subset of Thrift used by
+Parquet and contains various performance optimizations, such as careful
+memory allocation. The largest initial speedup came from removing
+intermediate structures and directly creating the needed in-memory 
representation.
+We also carefully hand-optimized several performance-critical code paths (see 
<a href="https://github.com/apache/arrow-rs/pull/8574"; target="_blank" 
rel="noopener">#8574</a>,
+<a href="https://github.com/apache/arrow-rs/pull/8587"; target="_blank" 
rel="noopener">#8587</a>, and <a 
href="https://github.com/apache/arrow-rs/pull/8599"; target="_blank" 
rel="noopener">#8599</a>).</p>
+<h3>Maintainability</h3>
+<p>The largest concern with a custom parser is that it is more difficult
+to maintain than generated parsers because the custom parser must be updated to
+reflect any changes to <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";
 target="_blank" rel="noopener">parquet.thrift</a>. This is a growing concern 
given the
+resurgent interest in Parquet and the recent addition of new features such as
+<a href="https://github.com/apache/parquet-format/blob/master/Geospatial.md"; 
target="_blank" rel="noopener">Geospatial</a> and <a 
href="https://github.com/apache/parquet-format/blob/master/VariantEncoding.md"; 
target="_blank" rel="noopener">Variant</a> types.</p>
+<p>Thankfully, after discussions with the community, <a 
href="https://github.com/jhorstmann"; target="_blank" rel="noopener">Jörn 
Horstmann</a> developed
+a <a href="https://github.com/jhorstmann/compact-thrift"; target="_blank" 
rel="noopener">Rust macro based approach</a> for generating code with annotated 
Rust structs
+that closely resemble the Thrift definitions while permitting additional hand
+optimization where necessary. This approach is similar to the <a 
href="https://serde.rs/"; target="_blank" rel="noopener">serde</a> crate
+where generic implementations can be generated with <code>#[derive]</code> 
annotations and
+specialized serialization is written by hand where needed. <a 
href="https://github.com/etseidl"; target="_blank" rel="noopener">Ed Seidl</a> 
then
+rewrote the metadata parsing code in the <a 
href="https://crates.io/crates/parquet"; target="_blank" 
rel="noopener">parquet</a> crate using these macros.
+Please see the <a href="https://github.com/apache/arrow-rs/pull/8530"; 
target="_blank" rel="noopener">final PR</a> for details of the level of effort 
involved.</p>
+<p>For example, here is the original Thrift definition of the <a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1254C1-L1314C2";
 target="_blank" rel="noopener"><code>FileMetaData</code></a> structure 
(comments omitted for brevity):</p>
+<div class="language-thrift highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code data-lang="thrift">struct FileMetaData {
+  1: required i32 version
+  2: required list&lt;SchemaElement&gt; schema;
+  3: required i64 num_rows
+  4: required list&lt;RowGroup&gt; row_groups
+  5: optional list&lt;KeyValue&gt; key_value_metadata
+  6: optional string created_by
+  7: optional list&lt;ColumnOrder&gt; column_orders;
+  8: optional EncryptionAlgorithm encryption_algorithm
+  9: optional binary footer_signing_key_metadata
+}
+</code></pre></div></div>
+<p>And here (<a 
href="https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/file/metadata/thrift_gen.rs#L146-L158";
 target="_blank" rel="noopener">source</a>) is the corresponding Rust structure 
using the Thrift macros (before Ed wrote a custom version in <a 
href="https://github.com/apache/arrow-rs/pull/8574"; target="_blank" 
rel="noopener">#8574</a>):</p>
+<div class="language-rust highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code data-lang="rust"><span 
class="nd">thrift_struct!</span><span class="p">(</span>
+<span class="k">struct</span> <span class="n">FileMetaData</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="p">{</span>
+<span class="mi">1</span><span class="p">:</span> <span 
class="n">required</span> <span class="nb">i32</span> <span 
class="n">version</span>
+<span class="mi">2</span><span class="p">:</span> <span 
class="n">required</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="nv">'a</span><span 
class="o">&gt;&lt;</span><span class="n">SchemaElement</span><span 
class="o">&gt;</span> <span class="n">schema</span><span class="p">;</span>
+<span class="mi">3</span><span class="p">:</span> <span 
class="n">required</span> <span class="nb">i64</span> <span 
class="n">num_rows</span>
+<span class="mi">4</span><span class="p">:</span> <span 
class="n">required</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="nv">'a</span><span 
class="o">&gt;&lt;</span><span class="n">RowGroup</span><span 
class="o">&gt;</span> <span class="n">row_groups</span>
+<span class="mi">5</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="n">KeyValue</span><span class="o">&gt;</span> 
<span class="n">key_value_metadata</span>
+<span class="mi">6</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">string</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="n">created_by</span>
+<span class="mi">7</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="n">ColumnOrder</span><span 
class="o">&gt;</span> <span class="n">column_orders</span><span 
class="p">;</span>
+<span class="mi">8</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">EncryptionAlgorithm</span> <span 
class="n">encryption_algorithm</span>
+<span class="mi">9</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">binary</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="n">footer_signing_key_metadata</span>
+<span class="p">}</span>
+<span class="p">);</span>
+</code></pre></div></div>
+<p>This system makes it easy to see the correspondence between the Thrift
+definition and the Rust structure, and it is straightforward to support newly 
added
+features such as <code>GeospatialStatistics</code>. The carefully hand-
+optimized parsers for the most performance-critical structures, such as
+<code>RowGroupMetaData</code> and <code>ColumnChunkMetaData</code>, are 
harder—though still
+straightforward—to update (see <a 
href="https://github.com/apache/arrow-rs/pull/8587"; target="_blank" 
rel="noopener">#8587</a>). However, those structures are also less
+likely to change frequently.</p>
+<h3>Future Improvements</h3>
+<p>With the custom parser in place, we are working on additional 
improvements:</p>
+<ul>
+<li>Implementing special "skip" indexes to skip directly to the parts of the 
metadata
+that are needed for a particular query, such as the row group offsets.</li>
+<li>Selectively decoding only the statistics for columns that are needed for a 
particular query.</li>
+<li>Potentially contributing the macros back to the thrift crate.</li>
+</ul>
+<h3>Conclusion</h3>
+<p>We believe metadata parsing in many open source Parquet
+readers is slow primarily because they use parsers automatically generated by 
Thrift
+compilers, which are not optimized for Parquet metadata parsing. By writing a
+custom parser, we significantly sped up metadata parsing in the
+<a href="https://crates.io/crates/parquet"; target="_blank" 
rel="noopener">parquet</a> Rust crate, which is widely used in the <a 
href="https://arrow.apache.org/";>Apache Arrow</a> ecosystem.</p>
+<p>While this is not the first open source custom Thrift parser for Parquet
+metadata (<a 
href="https://github.com/rapidsai/cudf/blob/branch-25.12/cpp/src/io/parquet/compact_protocol_reader.hpp";
 target="_blank" rel="noopener">CUDF has had one</a> for many years), we hope 
that our results will
+encourage additional Parquet implementations to consider similar optimizations.
+The approach and optimizations we describe in this post are likely applicable 
to
+Parquet implementations in other languages, such as C++ and Java.</p>
+<p>Previously, efforts like this were only possible at well-financed commercial
+enterprises. On behalf of the arrow-rs and Parquet contributors, we are excited
+to share this technology with the community in the upcoming <a 
href="https://crates.io/crates/parquet/57.0.0"; target="_blank" 
rel="noopener">57.0.0</a> release and
+invite you to <a 
href="https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md"; 
target="_blank" rel="noopener">come join us</a> and help make it even 
better!</p>
+
+      </main>
+    </div>
+
+    <hr>
+<footer class="footer">
+  <div class="row">
+    <div class="col-md-9">
+      <p>Apache Arrow, Arrow, Apache, the Apache logo, and the Apache Arrow 
project logo are either registered trademarks or trademarks of The Apache 
Software Foundation in the United States and other countries.</p>
+      <p>© 2016-2025 The Apache Software Foundation</p>
+    </div>
+    <div class="col-md-3">
+      <a class="d-sm-none d-md-inline pr-2" 
href="https://www.apache.org/events/current-event.html"; target="_blank" 
rel="noopener">
+        <img src="https://www.apache.org/events/current-event-234x60.png";>
+      </a>
+    </div>
+  </div>
+</footer>
+
+  </div>
+</body>
+</html>
diff --git a/blog/index.html b/blog/index.html
index b03f729b07b..153a7eb0326 100644
--- a/blog/index.html
+++ b/blog/index.html
@@ -248,6 +248,31 @@
 
 
   
+  <p>
+    </p>
+<h3>
+      <a href="/blog/2025/10/23/rust-parquet-metadata/">3x-9x Faster Apache 
Parquet Footer Metadata Using a Custom Thrift Parser in Rust</a>
+    </h3>
+    
+    <p>
+    <span class="blog-list-date">
+      23 October 2025
+    </span>
+    </p>
+    
+Editor’s Note: While Apache Arrow and Apache Parquet are separate projects,
+the Arrow arrow-rs repository hosts the development of the parquet Rust
+crate, a widely used and high-performance Parquet implementation.
+Summary
+Version 57.0.0 of the parquet Rust crate decodes metadata more than three times
+faster than previous versions thanks to a ne...
+     
+    <a href="/blog/2025/10/23/rust-parquet-metadata/">Read More →</a>
+
+  
+  
+
+  
   <p>
     </p>
 <h3>
diff --git a/feed.xml b/feed.xml
index 9768b162ffe..080eb8b8f86 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,267 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.4.1">Jekyll</generator><link 
href="https://arrow.apache.org/feed.xml"; rel="self" type="application/atom+xml" 
/><link href="https://arrow.apache.org/"; rel="alternate" type="text/html" 
/><updated>2025-10-12T16:26:45-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
 type="html">Apache Arrow</title><subtitle>Apache Arrow is the universal 
columnar fo [...]
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.4.1">Jekyll</generator><link 
href="https://arrow.apache.org/feed.xml"; rel="self" type="application/atom+xml" 
/><link href="https://arrow.apache.org/"; rel="alternate" type="text/html" 
/><updated>2025-10-23T12:22:18-04:00</updated><id>https://arrow.apache.org/feed.xml</id><title
 type="html">Apache Arrow</title><subtitle>Apache Arrow is the universal 
columnar fo [...]
+
+-->
+<p><em>Editor’s Note: While <a href="https://arrow.apache.org/";>Apache 
Arrow</a> and <a href="https://parquet.apache.org/";>Apache Parquet</a> are 
separate projects,
+the Arrow <a href="https://github.com/apache/arrow-rs";>arrow-rs</a> repository 
hosts the development of the <a 
href="https://crates.io/crates/parquet";>parquet</a> Rust
+crate, a widely used and high-performance Parquet implementation.</em></p>
+<h2>Summary</h2>
+<p>Version <a href="https://crates.io/crates/parquet/57.0.0";>57.0.0</a> of the 
<a href="https://crates.io/crates/parquet";>parquet</a> Rust crate decodes 
metadata more than three times
+faster than previous versions thanks to a new custom <a 
href="https://thrift.apache.org/";>Apache Thrift</a> parser. The new
+parser is both faster in all cases and enables further performance 
improvements not
+possible with generated parsers, such as skipping unnecessary fields and 
selective parsing.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/results.png" width="100%" 
class="img-responsive" alt="" aria-hidden="true">
+</div>
+<p><em>Figure 1:</em> Performance comparison of <a 
href="https://parquet.apache.org/";>Apache Parquet</a> metadata parsing using a 
generated
+Thrift parser (versions <code>56.2.0</code> and earlier) and the new
+<a href="https://github.com/apache/arrow-rs/issues/5854";>custom Thrift 
parser</a> in <a href="https://github.com/apache/arrow-rs";>arrow-rs</a> version 
<a href="https://crates.io/crates/parquet/57.0.0";>57.0.0</a>. No
+changes are needed to the Parquet format itself.
+See the <a href="https://github.com/alamb/parquet_footer_parsing";>benchmark 
page</a> for more details.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/scaling.png" width="100%" 
class="img-responsive" alt="Scaling behavior of custom Thrift parser" 
aria-hidden="true">
+</div>
+<p><em>Figure 2:</em> Speedup of the [custom Thrift decoder] for string and 
floating-point data types,
+for <code>100</code>, <code>1000</code>, <code>10,000</code>, and 
<code>100,000</code> columns. The new parser is faster in all cases,
+and the speedup is similar regardless of the number of columns. See the <a 
href="https://github.com/alamb/parquet_footer_parsing";>benchmark page</a> for 
more details.</p>
+<h2>Introduction: Parquet and the Importance of Metadata Parsing</h2>
+<p><a href="https://parquet.apache.org/";>Apache Parquet</a> is a popular 
columnar storage format
+designed to be efficient for both storage and query processing. Parquet
+files consist of a series of data pages, and a footer, as shown in Figure 3. 
The footer
+contains metadata about the file, including schema, statistics, and other
+information needed to decode the data pages.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/parquet.png" width="100%" 
class="img-responsive" alt="Physical File Structure of Parquet" 
aria-hidden="true">
+</div>
+<p><em>Figure 3:</em> Structure of a Parquet file showing the header, data 
pages, and footer metadata.</p>
+<p>Getting information stored in the footer is typically the first step in 
reading
+a Parquet file, as it is required to interpret the data pages. 
<em>Parsing</em> the
+footer is often performance critical:</p>
+<ul>
+<li>When reading from fast local storage, such as modern NVMe SSDs, footer 
parsing
+must be completed to know what data pages to read, placing it directly on the 
critical
+I/O path.</li>
+<li>Footer parsing scales linearly with the number of columns and row groups 
in a
+Parquet file and thus can be a bottleneck for tables with many columns or files
+with many row groups.</li>
+<li>Even in systems that cache the parsed footer in memory (see <a 
href="https://datafusion.apache.org/blog/2025/08/15/external-parquet-indexes/";>Using
+External Indexes, Metadata Stores, Catalogs and Caches to Accelerate Queries
+on Apache Parquet</a>), the footer must still be parsed on cache miss.</li>
+</ul>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/flow.png" width="100%" 
class="img-responsive" alt="Typical Parquet processing flow" aria-hidden="true">
+</div>
+<p><em>Figure 4:</em> Typical processing flow for Parquet files for stateless 
and stateful
+systems. Stateless engines read the footer on every query, so the time taken to
+parse the footer directly adds to query latency. Stateful systems cache some or
+all of the parsed footer in advance of queries.</p>
+<p>The speed of parsing metadata has grown even more important as Parquet 
spreads
+throughout the data ecosystem and is used for more latency-sensitive workloads 
such
+as observability, interactive analytics, and single-point
+lookups for Retrieval-Augmented Generation (RAG) applications feeding LLMs.
+As overall query times decrease, the proportion spent on footer parsing 
increases.</p>
+<h2>Background: Apache Thrift</h2>
+<p>Parquet stores metadata using <a href="https://thrift.apache.org/";>Apache 
Thrift</a>, a framework for
+network data types and service interfaces. It includes a <a 
href="https://thrift.apache.org/docs/idl";>data definition
+language</a> similar to <a 
href="https://developers.google.com/protocol-buffers";>Protocol Buffers</a>. 
Thrift definition files describe data
+types in a language-neutral way, and systems typically use code generators to
+automatically create code for a specific programming language to read and write
+those data types.</p>
+<p>The <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";>parquet.thrift</a>
 file defines the format of the metadata
+serialized at the end of each Parquet file in the <a 
href="https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md";>Thrift
 Compact
+protocol</a>, as shown below in Figure 5. The binary encoding is 
&quot;variable-length&quot;,
+meaning that the length of each element depends on its content, not
+just its type. Smaller-valued primitive types are encoded in fewer bytes than
+larger values, and strings and lists are stored inline, prefixed with their
+length.</p>
+<p>This encoding is space-efficient but, due to being variable-length, does not
+support random access: it is not possible to locate a particular field without
+scanning all previous fields. Other formats such as <a 
href="https://google.github.io/flatbuffers/";>FlatBuffers</a> provide
+random-access parsing and have been <a 
href="https://lists.apache.org/thread/j9qv5vyg0r4jk6tbm6sqthltly4oztd3";>proposed
 as alternatives</a> given their
+theoretical performance advantages. However, changing the Parquet format is a
+significant undertaking, requires buy-in from the community and ecosystem,
+and would likely take years to be adopted.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/thrift-compact-encoding.png" 
width="100%" class="img-responsive" alt="Thrift Compact Encoding Illustration" 
aria-hidden="true">
+</div>
+<p><em>Figure 5:</em> Parquet metadata is serialized using the <a 
href="https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md";>Thrift
 Compact protocol</a>.
+Each field is stored using a variable number of bytes that depends on its 
value.
+Primitive types use a variable-length encoding and strings and lists are
+prefixed with their lengths.</p>
+<p>Despite Thrift's very real disadvantage due to lack of random access, 
software
+optimizations are much easier to deploy than format changes. <a 
href="https://xiangpeng.systems/";>Xiangpeng Hao</a>'s
+previous analysis theorized significant (2x–4x) potential performance
+improvements simply by optimizing the implementation of Parquet footer parsing
+(see <a 
href="https://www.influxdata.com/blog/how-good-parquet-wide-tables/";>How Good 
is Parquet for Wide Tables (Machine Learning
+Workloads) Really?</a> for more details).</p>
+<h2>Processing Thrift Using Generated Parsers</h2>
+<p><em>Parsing</em> Parquet metadata is the process of decoding the 
Thrift-encoded bytes
+into in-memory structures that can be used for computation. Most Parquet
+implementations use one of the existing <a 
href="https://thrift.apache.org/lib/";>Thrift compilers</a> to generate a parser
+that converts Thrift binary data into generated code structures, and then copy
+relevant portions of those generated structures into API-level structures.
+For example, the <a 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet";>C/C++
 Parquet implementation</a> includes a <a 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/build-support/update-thrift.sh#L23";>two</a>-<a
 
href="https://github.com/apache/arrow/blob/e1f727cbb447d2385949a54d8f4be2fdc6cefe29/cpp/src/parquet/thrift_internal.h#L56";>step</a>
 process,
+as does <a 
href="https://github.com/apache/parquet-java/blob/0fea3e1e22fffb0a25193e3efb9a5d090899458a/parquet-format-structures/pom.xml#L69-L88";>parquet-java</a>.
 <a 
href="https://github.com/duckdb/duckdb/blob/8f512187537c65d36ce6d6f562b75a37e8d4ee54/third_party/parquet/parquet_types.h#L1-L6";>DuckDB</a>
 also contains a Thrift compiler–generated
+parser.</p>
+<p>In versions <code>56.2.0</code> and earlier, the Apache Arrow Rust 
implementation used the
+same pattern. The <a 
href="https://docs.rs/parquet/56.2.0/parquet/format/index.html";>format</a> 
module contains a parser generated by the <a 
href="https://crates.io/crates/thrift";>thrift
+crate</a> and the <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";>parquet.thrift</a>
 definition. Parsing metadata involves:</p>
+<ol>
+<li>Invoke the generated parser on the Thrift binary data, producing
+generated in-memory structures (e.g., <a 
href="https://docs.rs/parquet/56.2.0/parquet/format/struct.FileMetaData.html";><code>struct
 FileMetaData</code></a>), then</li>
+<li>Copy the relevant fields into a more user-friendly representation,
+<a 
href="https://docs.rs/parquet/56.2.0/parquet/file/metadata/struct.ParquetMetaData.html";><code>ParquetMetadata</code></a>.</li>
+</ol>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/original-pipeline.png" width="100%" 
class="img-responsive" alt="Original Parquet Parsing Pipeline" 
aria-hidden="true">
+</div>
+<p><em>Figure 6:</em> Two-step process to read Parquet metadata: A parser 
created with the
+<code>thrift</code> crate and <code>parquet.thrift</code> parses the metadata 
bytes
+into generated in-memory structures. These structures are then converted into
+API objects.</p>
+<p>The parsers generated by standard Thrift compilers typically parse 
<em>all</em> fields
+in a single pass over the Thrift-encoded bytes, copying data into in-memory,
+heap-allocated structures (e.g., Rust <a 
href="https://doc.rust-lang.org/std/vec/struct.Vec.html";><code>Vec</code></a>, 
or C++ <a 
href="https://en.cppreference.com/w/cpp/container/vector.html";><code>std::vector</code></a>)
 as shown
+in Figure 7 below.</p>
+<p>Parsing all fields is straightforward and a good default
+choice given Thrift's original design goal of encoding network messages.
+Network messages typically don't contain extra information irrelevant for 
receivers;
+however, Parquet metadata often <em>does</em> contain information
+that is not needed for a particular query. In such cases, parsing the entire
+metadata into in-memory structures is wasteful.</p>
+<p>For example, a query on a file with 1,000 columns that reads
+only 10 columns and has a single column predicate
+(e.g., <code>time &gt; now() - '1 minute'</code>) only needs</p>
+<ol>
+<li><a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L912";><code>Statistics</code></a>
 (or <a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1163";><code>ColumnIndex</code></a>)
 for the <code>time</code> column</li>
+<li><a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L958";><code>ColumnChunk</code></a>
 information for the 10 selected columns</li>
+</ol>
+<p>The default strategy to parse (allocating and copying) all statistics and 
all
+<code>ColumnChunks</code> results in creating 999 more statistics and 990 more 
<code>ColumnChunks</code>
+than necessary. As discussed above, given the
+variable encoding used for the metadata, all metadata bytes must still be
+fetched and scanned; however, CPUs are (very) fast at scanning data, and
+skipping <em>parsing</em> of unneeded fields speeds up overall metadata 
performance
+significantly.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/thrift-parsing-allocations.png" 
width="100%" class="img-responsive" alt="Thrift Parsing Allocations" 
aria-hidden="true">
+</div>
+<p><em>Figure 7:</em> Generated Thrift parsers typically parse encoded bytes 
into
+structures requiring many small heap allocations, which are expensive.</p>
+<h2>New Design: Custom Thrift Parser</h2>
+<p>As is typical of generated code, opportunities for specializing
+the behavior of generated Thrift parsers is limited:</p>
+<ol>
+<li>It is not easy to modify (it is re-generated from the
+Thrift definitions when they change and carries the warning
+<code>/* DO NOT EDIT UNLESS YOU ARE SURE THAT YOU KNOW WHAT YOU ARE DOING 
*/</code>).</li>
+<li>It typically maps one-to-one with Thrift definitions, limiting
+additional optimizations such as zero-copy parsing, field
+skipping, and amortized memory allocation strategies.</li>
+<li>Its API is very stable (hard to change), which is important for easy 
maintenance when a large number
+of projects are built using the <a 
href="https://crates.io/crates/thrift";>thrift crate</a>. For example, the
+<a href="https://crates.io/crates/thrift/0.17.0";>last release of the Rust 
<code>thrift</code> crate</a> was almost three years ago at
+the time of this writing.</li>
+</ol>
+<p>These limitations are a consequence of the Thrift project's design goals: 
general purpose
+code that is easy to embed in a wide variety of other projects, rather than
+any fundamental limitation of the Thrift format.
+Given our goal of fast Parquet metadata parsing, we needed
+a custom, easier to optimize parser, to convert Thrift binary directly into 
the needed
+structures (Figure 8). Since arrow-rs already did some postprocessing on the 
generated code
+and included a custom implementation of the compact protocol api, this change
+to a completely custom parser was a natural next step.</p>
+<!-- Image source: 
https://docs.google.com/presentation/d/1WjX4t7YVj2kY14SqCpenGqNl_swjdHvPg86UeBT3IcY
 -->
+<div style="display: flex; gap: 16px; justify-content: center; align-items: 
flex-start;">
+  <img src="/img/rust-parquet-metadata/new-pipeline.png" width="100%" 
class="img-responsive" alt="New Parquet Parsing Pipeline" aria-hidden="true">
+</div>
+<p><em>Figure 8:</em> One-step Parquet metadata parsing using a custom Thrift 
parser. The
+Thrift binary is parsed directly into the desired in-memory representation with
+highly optimized code.</p>
+<p>Our new custom parser is optimized for the specific subset of Thrift used by
+Parquet and contains various performance optimizations, such as careful
+memory allocation. The largest initial speedup came from removing
+intermediate structures and directly creating the needed in-memory 
representation.
+We also carefully hand-optimized several performance-critical code paths (see 
<a href="https://github.com/apache/arrow-rs/pull/8574";>#8574</a>,
+<a href="https://github.com/apache/arrow-rs/pull/8587";>#8587</a>, and <a 
href="https://github.com/apache/arrow-rs/pull/8599";>#8599</a>).</p>
+<h3>Maintainability</h3>
+<p>The largest concern with a custom parser is that it is more difficult
+to maintain than generated parsers because the custom parser must be updated to
+reflect any changes to <a 
href="https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift";>parquet.thrift</a>.
 This is a growing concern given the
+resurgent interest in Parquet and the recent addition of new features such as
+<a 
href="https://github.com/apache/parquet-format/blob/master/Geospatial.md";>Geospatial</a>
 and <a 
href="https://github.com/apache/parquet-format/blob/master/VariantEncoding.md";>Variant</a>
 types.</p>
+<p>Thankfully, after discussions with the community, <a 
href="https://github.com/jhorstmann";>Jörn Horstmann</a> developed
+a <a href="https://github.com/jhorstmann/compact-thrift";>Rust macro based 
approach</a> for generating code with annotated Rust structs
+that closely resemble the Thrift definitions while permitting additional hand
+optimization where necessary. This approach is similar to the <a 
href="https://serde.rs/";>serde</a> crate
+where generic implementations can be generated with <code>#[derive]</code> 
annotations and
+specialized serialization is written by hand where needed. <a 
href="https://github.com/etseidl";>Ed Seidl</a> then
+rewrote the metadata parsing code in the <a 
href="https://crates.io/crates/parquet";>parquet</a> crate using these macros.
+Please see the <a href="https://github.com/apache/arrow-rs/pull/8530";>final 
PR</a> for details of the level of effort involved.</p>
+<p>For example, here is the original Thrift definition of the <a 
href="https://github.com/apache/parquet-format/blob/9fd57b59e0ce1a82a69237dcf8977d3e72a2965d/src/main/thrift/parquet.thrift#L1254C1-L1314C2";><code>FileMetaData</code></a>
 structure (comments omitted for brevity):</p>
+<div class="language-thrift highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code data-lang="thrift">struct FileMetaData {
+  1: required i32 version
+  2: required list&lt;SchemaElement&gt; schema;
+  3: required i64 num_rows
+  4: required list&lt;RowGroup&gt; row_groups
+  5: optional list&lt;KeyValue&gt; key_value_metadata
+  6: optional string created_by
+  7: optional list&lt;ColumnOrder&gt; column_orders;
+  8: optional EncryptionAlgorithm encryption_algorithm
+  9: optional binary footer_signing_key_metadata
+}
+</code></pre></div></div>
+<p>And here (<a 
href="https://github.com/apache/arrow-rs/blob/02fa779a9cb122c5218293be3afb980832701683/parquet/src/file/metadata/thrift_gen.rs#L146-L158";>source</a>)
 is the corresponding Rust structure using the Thrift macros (before Ed wrote a 
custom version in <a 
href="https://github.com/apache/arrow-rs/pull/8574";>#8574</a>):</p>
+<div class="language-rust highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code data-lang="rust"><span 
class="nd">thrift_struct!</span><span class="p">(</span>
+<span class="k">struct</span> <span class="n">FileMetaData</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="p">{</span>
+<span class="mi">1</span><span class="p">:</span> <span 
class="n">required</span> <span class="nb">i32</span> <span 
class="n">version</span>
+<span class="mi">2</span><span class="p">:</span> <span 
class="n">required</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="nv">'a</span><span 
class="o">&gt;&lt;</span><span class="n">SchemaElement</span><span 
class="o">&gt;</span> <span class="n">schema</span><span class="p">;</span>
+<span class="mi">3</span><span class="p">:</span> <span 
class="n">required</span> <span class="nb">i64</span> <span 
class="n">num_rows</span>
+<span class="mi">4</span><span class="p">:</span> <span 
class="n">required</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="nv">'a</span><span 
class="o">&gt;&lt;</span><span class="n">RowGroup</span><span 
class="o">&gt;</span> <span class="n">row_groups</span>
+<span class="mi">5</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="n">KeyValue</span><span class="o">&gt;</span> 
<span class="n">key_value_metadata</span>
+<span class="mi">6</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">string</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="n">created_by</span>
+<span class="mi">7</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">list</span><span 
class="o">&lt;</span><span class="n">ColumnOrder</span><span 
class="o">&gt;</span> <span class="n">column_orders</span><span 
class="p">;</span>
+<span class="mi">8</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">EncryptionAlgorithm</span> <span 
class="n">encryption_algorithm</span>
+<span class="mi">9</span><span class="p">:</span> <span 
class="n">optional</span> <span class="n">binary</span><span 
class="o">&lt;</span><span class="nv">'a</span><span class="o">&gt;</span> 
<span class="n">footer_signing_key_metadata</span>
+<span class="p">}</span>
+<span class="p">);</span>
+</code></pre></div></div>
+<p>This system makes it easy to see the correspondence between the Thrift
+definition and the Rust structure, and it is straightforward to support newly 
added
+features such as <code>GeospatialStatistics</code>. The carefully hand-
+optimized parsers for the most performance-critical structures, such as
+<code>RowGroupMetaData</code> and <code>ColumnChunkMetaData</code>, are 
harder—though still
+straightforward—to update (see <a 
href="https://github.com/apache/arrow-rs/pull/8587";>#8587</a>). However, those 
structures are also less
+likely to change frequently.</p>
+<h3>Future Improvements</h3>
+<p>With the custom parser in place, we are working on additional 
improvements:</p>
+<ul>
+<li>Implementing special &quot;skip&quot; indexes to skip directly to the 
parts of the metadata
+that are needed for a particular query, such as the row group offsets.</li>
+<li>Selectively decoding only the statistics for columns that are needed for a 
particular query.</li>
+<li>Potentially contributing the macros back to the thrift crate.</li>
+</ul>
+<h3>Conclusion</h3>
+<p>We believe metadata parsing in many open source Parquet
+readers is slow primarily because they use parsers automatically generated by 
Thrift
+compilers, which are not optimized for Parquet metadata parsing. By writing a
+custom parser, we significantly sped up metadata parsing in the
+<a href="https://crates.io/crates/parquet";>parquet</a> Rust crate, which is 
widely used in the <a href="https://arrow.apache.org/";>Apache Arrow</a> 
ecosystem.</p>
+<p>While this is not the first open source custom Thrift parser for Parquet
+metadata (<a 
href="https://github.com/rapidsai/cudf/blob/branch-25.12/cpp/src/io/parquet/compact_protocol_reader.hpp";>CUDF
 has had one</a> for many years), we hope that our results will
+encourage additional Parquet implementations to consider similar optimizations.
+The approach and optimizations we describe in this post are likely applicable 
to
+Parquet implementations in other languages, such as C++ and Java.</p>
+<p>Previously, efforts like this were only possible at well-financed commercial
+enterprises. On behalf of the arrow-rs and Parquet contributors, we are excited
+to share this technology with the community in the upcoming <a 
href="https://crates.io/crates/parquet/57.0.0";>57.0.0</a> release and
+invite you to <a 
href="https://github.com/apache/arrow-rs/blob/main/CONTRIBUTING.md";>come join 
us</a> and help make it even 
better!</p>]]></content><author><name>alamb</name></author><category 
term="release" /><summary type="html"><![CDATA[Editor’s Note: While Apache 
Arrow and Apache Parquet are separate projects, the Arrow arrow-rs repository 
hosts the development of the parquet Rust crate, a widely used and 
high-performance Parquet implementation. Summary Version 57.0.0 of the parquet  
[...]
 
 -->
 <p>The Apache Arrow team is pleased to announce the version 20 release of
@@ -875,63 +1138,4 @@ This minor release covers 21 commits from 8 distinct 
contributors.</p>
 <li>@ashishnegi made their first contribution in <a 
href="https://github.com/apache/arrow-go/pull/366";>#366</a></li>
 <li>@mateuszrzeszutek made their first contribution in <a 
href="https://github.com/apache/arrow-go/pull/361";>#361</a></li>
 </ul>
-<p><strong>Full Changelog</strong>: <a 
href="https://github.com/apache/arrow-go/compare/v18.2.0...v18.3.0";>https://github.com/apache/arrow-go/compare/v18.2.0...v18.3.0</a></p>]]></content><author><name>pmc</name></author><category
 term="release" /><summary type="html"><![CDATA[The Apache Arrow team is 
pleased to announce the v18.3.0 release of Apache Arrow Go. This minor release 
covers 21 commits from 8 distinct contributors. Contributors $ git shortlog -sn 
v18.2.0..v18.3.0 13 Matt Topol [...]
-
--->
-<p>The Apache Arrow team is pleased to announce the version 18 release of
-the Apache Arrow ADBC libraries. This release includes <a 
href="https://github.com/apache/arrow-adbc/milestone/22";><strong>28
-resolved issues</strong></a> from <a href="#contributors"><strong>22 distinct 
contributors</strong></a>.</p>
-<p>This is a release of the <strong>libraries</strong>, which are at version 
18.  The
-<a 
href="https://arrow.apache.org/adbc/18/format/specification.html";><strong>API 
specification</strong></a> is versioned separately and is at
-version 1.1.0.</p>
-<p>The subcomponents are versioned independently:</p>
-<ul>
-<li>C/C++/GLib/Go/Python/Ruby: 1.6.0</li>
-<li>C#: 0.18.0</li>
-<li>Java: 0.18.0</li>
-<li>R: 0.18.0</li>
-<li>Rust: 0.18.0</li>
-</ul>
-<p>The release notes below are not exhaustive and only expose selected
-highlights of the release. Many other bugfixes and improvements have
-been made: we refer you to the <a 
href="https://github.com/apache/arrow-adbc/blob/apache-arrow-adbc-18/CHANGELOG.md";>complete
 changelog</a>.</p>
-<h2>Release Highlights</h2>
-<p>Using Meson to build the project has been improved (#2735, #2746).</p>
-<p>The C# bindings and its drivers have seen a lot of activity in this 
release.  A Databricks Spark driver is now available (#2672, #2737, #2743, 
#2692), with support for features like CloudFetch (#2634, #2678, #2691).  The 
general Spark driver now has better retry behavior for 503 responses (#2664), 
supports LZ4 compression applied outside of the Arrow IPC format (#2669), and 
supports OAuth (#2579), among other improvements.  The &quot;Apache&quot; 
driver for various Thrift-based system [...]
-<p>The Flight SQL driver supports OAuth (#2651).</p>
-<p>The Java bindings experimentally support a JNI wrapper around drivers 
exposing the ADBC C API (#2401).  These are not currently distributed via Maven 
and must be built by hand.</p>
-<p>The Go bindings now support union types in the <code>database/sql</code> 
wrapper (#2637).  The Golang-based BigQuery driver returns more metadata about 
tables (#2697).</p>
-<p>The PostgreSQL driver now avoids spurious commit/rollback commands (#2685). 
 It also handles improper usage more gracefully (#2653).</p>
-<p>The Python bindings now make it easier to pass options in various places 
(#2589, #2700).  Also, the DB-API layer can be minimally used without PyArrow 
installed, making it easier for users of libraries like polars that don't need 
or want a second Arrow implementation (#2609).</p>
-<p>The Rust bindings now avoid locking the driver on every operation, allowing 
concurrent usage (#2736).</p>
-<h2>Contributors</h2>
-<div class="highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>$ git shortlog --perl-regexp 
--author='^((?!dependabot\[bot\]).*)$' -sn 
apache-arrow-adbc-17..apache-arrow-adbc-18
-    20 David Li
-     6 William Ayd
-     5 Curt Hagenlocher
-     5 davidhcoe
-     4 Alex Guo
-     4 Felipe Oliveira Carvalho
-     4 Jade Wang
-     4 Matthijs Brobbel
-     4 Sutou Kouhei
-     4 eric-wang-1990
-     3 Bruce Irschick
-     2 Milos Gligoric
-     2 Sudhir Reddy Emmadi
-     2 Todd Meng
-     1 Bryce Mecum
-     1 Dewey Dunnington
-     1 Filip Wojciechowski
-     1 Hiroaki Yutani
-     1 Hélder Gregório
-     1 Marin Nozhchev
-     1 amangoyal
-     1 qifanzhang-ms
-</code></pre></div></div>
-<h2>Roadmap</h2>
-<p>There is some discussion on a potential second revision of ADBC to include 
more missing functionality and asynchronous API support.  For more, see the <a 
href="https://github.com/apache/arrow-adbc/milestone/8";>milestone</a>.  We 
would welcome suggestions on APIs that could be added or extended.  Some of the 
contributors are planning to begin work on a proposal in the near future.</p>
-<h2>Getting Involved</h2>
-<p>We welcome questions and contributions from all interested.  Issues
-can be filed on <a 
href="https://github.com/apache/arrow-adbc/issues";>GitHub</a>, and questions 
can be directed to GitHub
-or the <a href="/community/">Arrow mailing 
lists</a>.</p>]]></content><author><name>pmc</name></author><category 
term="release" /><summary type="html"><![CDATA[The Apache Arrow team is pleased 
to announce the version 18 release of the Apache Arrow ADBC libraries. This 
release includes 28 resolved issues from 22 distinct contributors. This is a 
release of the libraries, which are at version 18. The API specification is 
versioned separately and is at version 1.1.0. The subcomponents are ve [...]
\ No newline at end of file
+<p><strong>Full Changelog</strong>: <a 
href="https://github.com/apache/arrow-go/compare/v18.2.0...v18.3.0";>https://github.com/apache/arrow-go/compare/v18.2.0...v18.3.0</a></p>]]></content><author><name>pmc</name></author><category
 term="release" /><summary type="html"><![CDATA[The Apache Arrow team is 
pleased to announce the v18.3.0 release of Apache Arrow Go. This minor release 
covers 21 commits from 8 distinct contributors. Contributors $ git shortlog -sn 
v18.2.0..v18.3.0 13 Matt Topol [...]
\ No newline at end of file
diff --git a/img/rust-parquet-metadata/flow.png 
b/img/rust-parquet-metadata/flow.png
new file mode 100644
index 00000000000..1c77d9e0e6a
Binary files /dev/null and b/img/rust-parquet-metadata/flow.png differ
diff --git a/img/rust-parquet-metadata/new-pipeline.png 
b/img/rust-parquet-metadata/new-pipeline.png
new file mode 100644
index 00000000000..acd0ef34988
Binary files /dev/null and b/img/rust-parquet-metadata/new-pipeline.png differ
diff --git a/img/rust-parquet-metadata/original-pipeline.png 
b/img/rust-parquet-metadata/original-pipeline.png
new file mode 100644
index 00000000000..7e849d4620c
Binary files /dev/null and b/img/rust-parquet-metadata/original-pipeline.png 
differ
diff --git a/img/rust-parquet-metadata/parquet.png 
b/img/rust-parquet-metadata/parquet.png
new file mode 100644
index 00000000000..3dc3438a7cc
Binary files /dev/null and b/img/rust-parquet-metadata/parquet.png differ
diff --git a/img/rust-parquet-metadata/results.png 
b/img/rust-parquet-metadata/results.png
new file mode 100644
index 00000000000..8ceb83fc25a
Binary files /dev/null and b/img/rust-parquet-metadata/results.png differ
diff --git a/img/rust-parquet-metadata/scaling.png 
b/img/rust-parquet-metadata/scaling.png
new file mode 100644
index 00000000000..1074006f9e2
Binary files /dev/null and b/img/rust-parquet-metadata/scaling.png differ
diff --git a/img/rust-parquet-metadata/thrift-compact-encoding.png 
b/img/rust-parquet-metadata/thrift-compact-encoding.png
new file mode 100644
index 00000000000..0b8014b1872
Binary files /dev/null and 
b/img/rust-parquet-metadata/thrift-compact-encoding.png differ
diff --git a/img/rust-parquet-metadata/thrift-parsing-allocations.png 
b/img/rust-parquet-metadata/thrift-parsing-allocations.png
new file mode 100644
index 00000000000..57b34d5b9e5
Binary files /dev/null and 
b/img/rust-parquet-metadata/thrift-parsing-allocations.png differ

(arrow-site) branch asf-site updated: Updating built site

Reply via email to