This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 279e98c Commit build products
279e98c is described below
commit 279e98c1cab32f2459c5baa96b1c9f4adc017905
Author: Build Pelican (action) <[email protected]>
AuthorDate: Tue Mar 25 15:47:23 2025 +0000
Commit build products
---
output/2025/03/21/parquet-pushdown/index.html | 311 +++++++++++++++++++++
output/author/xiangpeng-hao.html | 57 ++++
output/category/blog.html | 57 ++++
output/feed.xml | 40 ++-
output/feeds/all-en.atom.xml | 290 ++++++++++++++++++-
output/feeds/blog.atom.xml | 290 ++++++++++++++++++-
output/feeds/xiangpeng-hao.atom.xml | 290 ++++++++++++++++++-
output/feeds/xiangpeng-hao.rss.xml | 40 ++-
output/images/parquet-pushdown/baseline-impl.jpg | Bin 0 -> 166560 bytes
output/images/parquet-pushdown/cached-pages.jpg | Bin 0 -> 37963 bytes
output/images/parquet-pushdown/new-pipeline.jpg | Bin 0 -> 106800 bytes
output/images/parquet-pushdown/parquet-viewer.jpg | Bin 0 -> 344238 bytes
.../parquet-pushdown/pushdown-vs-no-pushdown.jpg | Bin 0 -> 194575 bytes
output/index.html | 57 ++++
14 files changed, 1427 insertions(+), 5 deletions(-)
diff --git a/output/2025/03/21/parquet-pushdown/index.html
b/output/2025/03/21/parquet-pushdown/index.html
new file mode 100644
index 0000000..49c5d9b
--- /dev/null
+++ b/output/2025/03/21/parquet-pushdown/index.html
@@ -0,0 +1,311 @@
+<!doctype html>
+<html class="no-js" lang="en" dir="ltr">
+ <head>
+ <meta charset="utf-8">
+ <meta http-equiv="x-ua-compatible" content="ie=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
+ <title>Efficient Filter Pushdown in Parquet - Apache DataFusion
Blog</title>
+<link href="/blog/css/bootstrap.min.css" rel="stylesheet">
+<link href="/blog/css/fontawesome.all.min.css" rel="stylesheet">
+<link href="/blog/css/headerlink.css" rel="stylesheet">
+<link href="/blog/highlight/default.min.css" rel="stylesheet">
+<script src="/blog/highlight/highlight.js"></script>
+<script>hljs.highlightAll();</script> </head>
+ <body class="d-flex flex-column h-100">
+ <main class="flex-shrink-0">
+<!-- nav bar -->
+<nav class="navbar navbar-expand-lg navbar-dark bg-dark" aria-label="Fifth
navbar example">
+ <div class="container-fluid">
+ <a class="navbar-brand" href="/blog"><img
src="/blog/images/logo_original4x.png" style="height: 32px;"/> Apache
DataFusion Blog</a>
+ <button class="navbar-toggler" type="button" data-bs-toggle="collapse"
data-bs-target="#navbarADP" aria-controls="navbarADP" aria-expanded="false"
aria-label="Toggle navigation">
+ <span class="navbar-toggler-icon"></span>
+ </button>
+
+ <div class="collapse navbar-collapse" id="navbarADP">
+ <ul class="navbar-nav me-auto mb-2 mb-lg-0">
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/about.html">About</a>
+ </li>
+ <li class="nav-item">
+ <a class="nav-link" href="/blog/feed.xml">RSS</a>
+ </li>
+ </ul>
+ </div>
+ </div>
+</nav>
+
+
+<!-- page contents -->
+<div id="contents">
+ <div class="bg-white p-5 rounded">
+ <div class="col-sm-8 mx-auto">
+ <h1>
+ Efficient Filter Pushdown in Parquet
+ </h1>
+ <p>Posted on: Fri 21 March 2025 by Xiangpeng Hao</p>
+ <style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng Hao's
blog</a>. Thanks to <a href="https://www.influxdata.com/">InfluxData</a> for
sponsoring this work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post</a>, we discussed how <a href="https://datafusion.apache.org/">Apache
DataFusion</a> prunes <a href="https://parquet.apache.org/">Apache Parquet</a>
files to skip irrelevant <strong>files/row_groups</strong> (sometimes also <a
href="https://parquet.apache.org/docs/file-format/pageindex/">pages</a>).</p>
+<p>This post discusses how Parquet readers skip irrelevant
<strong>rows</strong> while scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.</p>
+<h2>Why filter pushdown in Parquet?</h2>
+<p>Below is an example query that reads sensor data with filters on
<code>date_time</code> and <code>location</code>.
+Without filter pushdown, all rows from <code>location</code>,
<code>val</code>, and <code>date_time</code> columns are decoded before
<code>location='office'</code> is evaluated. Filter pushdown is especially
useful when the filter is selective, i.e., removes many rows.</p>
+<div class="codehilite"><pre><span></span><code><span class="k">SELECT</span>
<span class="n">val</span><span class="p">,</span> <span
class="k">location</span>
+<span class="k">FROM</span> <span class="n">sensor_data</span>
+<span class="k">WHERE</span> <span class="n">date_time</span> <span
class="o">></span> <span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span> <span class="o">=</span>
<span class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<figure>
+<img alt="Parquet pruning skips irrelevant files/row_groups, while filter
pushdown skips irrelevant rows. Without filter pushdown, all rows from
location, val, and date_time columns are decoded before `location='office'` is
evaluated. Filter pushdown is especially useful when the filter is selective,
i.e., removes many rows." class="img-responsive"
src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" width="80%"/>
+<figcaption>
+ Parquet pruning skips irrelevant files/row_groups, while filter pushdown
skips irrelevant rows. Without filter pushdown, all rows from location, val,
and date_time columns are decoded before `location='office'` is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.
+ </figcaption>
+</figure>
+<p>In our setup, sensor data is aggregated by date — each day has its
own Parquet file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e.,
<code>2025-03-10.parquet</code> and <code>2025-03-11.parquet</code>.</p>
+<p>Once the files to read are located, the <a
href="https://github.com/apache/datafusion/issues/3463"><em>DataFusion's
current default implementation</em></a> reads all the projected columns
(<code>sensor_id</code>, <code>val</code>, and <code>location</code>) into
Arrow RecordBatches, then applies the filters over <code>location</code> to get
the final set of rows.</p>
+<p>A better approach is called <strong>filter pushdown</strong> with
<strong>late materialization</strong>, which evaluates filter conditions first
and only decodes data that passes these conditions.
+In practice, this works by first processing only the filter columns
(<code>date_time</code> and <code>location</code>), building a boolean mask of
rows that satisfy our conditions, then using this mask to selectively decode
only the relevant rows from other columns (<code>sensor_id</code>,
<code>val</code>).
+This eliminates the waste of decoding rows that will be immediately filtered
out.</p>
+<p>While simple in theory, practical implementations often make performance
worse.</p>
+<h2>How can filter pushdown be slower?</h2>
+<p>At a high level, the Parquet reader first builds a filter mask --
essentially a boolean array indicating which rows meet the filter criteria --
and then uses this mask to selectively decode only the needed rows from the
remaining columns in the projection.</p>
+<p>Let's dig into details of <a
href="https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712">how
filter pushdown is implemented</a> in the current Rust Parquet reader
implementation, illustrated in the following figure.</p>
+<figure>
+<img alt="Implementation of filter pushdown in Rust Parquet readers"
class="img-responsive" src="/blog/images/parquet-pushdown/baseline-impl.jpg"
with="70%"/>
+<figcaption>
+ Implementation of filter pushdown in Rust Parquet readers -- the first
phase builds the filter mask, the second phase applies the filter mask to the
other columns
+ </figcaption>
+</figure>
+<p>The filter pushdown has two phases:</p>
+<ol>
+<li>
+<p>Build the filter mask (steps 1-3)</p>
+</li>
+<li>
+<p>Use the filter mask to selectively decode other columns (steps 4-7), e.g.,
output step 3 is used as input for step 5 and 7.</p>
+</li>
+</ol>
+<p>Within each phase, it takes three steps from Parquet to Arrow:</p>
+<ol>
+<li>
+<p>Decompress the Parquet pages using generic decompression algorithms like
LZ4, Zstd, etc. (steps 1, 4, 6)</p>
+</li>
+<li>
+<p>Decode the page content into Arrow format (steps 2, 5, 7)</p>
+</li>
+<li>
+<p>Evaluate the filter over Arrow data (step 3)</p>
+</li>
+</ol>
+<p>In the figure above, we can see that <code>location</code> is
<strong>decompressed and decoded twice</strong>, first when building the filter
mask (steps 1, 2), and second when building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and output.</p>
+<p>The table below shows the corresponding CPU time on the <a
href="https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql#L23">ClickBench
query 22</a>:</p>
+<div class="codehilite"><pre><span></span><code><span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| Decompress | Decode | Apply filter| Others |</span>
+<span class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| 206 ms | 117 ms | 22 ms | 48 ms |</span>
+<span class="err">+------------+--------+-------------+--------+</span>
+</code></pre></div>
+<p>Clearly, decompress/decode operations dominate the time spent. With filter
pushdown, it needs to decompress/decode twice; but without filter pushdown, it
only needs to do this once.
+This explains why filter pushdown is slower in some cases.</p>
+<blockquote>
+<p><strong>Note:</strong> Highly selective filters may skip the entire page;
but as long as it reads one row from the page, it needs to decompress and often
decode the entire page.</p>
+</blockquote>
+<h2>Attempt: cache filter columns</h2>
+<p>Intuitively, caching the filter columns and reusing them later could
help.</p>
+<p>But naively caching decoded pages consumes prohibitively high memory:</p>
+<ol>
+<li>
+<p>It needs to cache Arrow arrays, which are on average <a
href="https://github.com/XiangpengHao/liquid-cache/blob/main/dev/doc/liquid-cache-vldb.pdf">4x
larger than Parquet data</a>.</p>
+</li>
+<li>
+<p>It needs to cache the <strong>entire column chunk in memory</strong>,
because in Phase 1 it builds filters over the column chunk, and only use it in
Phase 2. </p>
+</li>
+<li>
+<p>The memory usage is proportional to the number of filter columns, which can
be unboundedly high. </p>
+</li>
+</ol>
+<p>Worse, caching filter columns means it needs to read partially from Parquet
and partially from cache, which is complex to implement, likely requiring a
substantial change to the current implementation. </p>
+<blockquote>
+<p><strong>Feel the complexity:</strong> consider building a cache that
properly handles nested columns, multiple filters, and filters with multiple
columns.</p>
+</blockquote>
+<h2>Real solution</h2>
+<p>We need a solution that:</p>
+<ol>
+<li>
+<p>Is simple to implement, i.e., doesn't require thousands of lines of
code.</p>
+</li>
+<li>
+<p>Incurs minimal memory overhead.</p>
+</li>
+</ol>
+<p>This section describes my <a
href="https://github.com/apache/arrow-rs/pull/6921#issuecomment-2718792433"><700
LOC PR (with lots of comments and tests)</a> that <strong>reduces total
ClickBench time by 15%, with up to 2x lower latency for some queries, no
obvious regression on other queries, and caches at most 2 pages (~2MB) per
column in memory</strong>.</p>
+<figure>
+<img alt="New decoding pipeline, building filter mask and output columns are
interleaved in a single pass, allowing us to cache minimal pages for minimal
amount of time" class="img-responsive"
src="/blog/images/parquet-pushdown/new-pipeline.jpg" width="80%"/>
+<figcaption>
+ New decoding pipeline, building filter mask and output columns are
interleaved in a single pass, allowing us to cache minimal pages for minimal
amount of time
+ </figcaption>
+</figure>
+<p>The new pipeline interleaves the previous two phases into a single pass, so
that:</p>
+<ol>
+<li>
+<p>The page being decompressed is immediately used to build filter masks and
output columns.</p>
+</li>
+<li>
+<p>Decompressed pages are cached for minimal time; after one pass (steps 1-6),
the cache memory is released for the next pass. </p>
+</li>
+</ol>
+<p>This allows the cache to only hold 1 page at a time, and to immediately
discard the previous page after it's used, significantly reducing the memory
requirement for caching.</p>
+<h3>What pages are cached?</h3>
+<p>You may have noticed that only <code>location</code> is cached, not
<code>val</code>, because <code>val</code> is only used for output.
+More generally, only columns that appear both in the filter and output are
cached, and at most 1 page is cached for each such column.</p>
+<p>More examples:</p>
+<div class="codehilite"><pre><span></span><code><span class="k">SELECT</span>
<span class="n">val</span>
+<span class="k">FROM</span> <span class="n">sensor_data</span>
+<span class="k">WHERE</span> <span class="n">date_time</span> <span
class="o">></span> <span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span> <span class="o">=</span>
<span class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, no columns are cached, because <code>val</code> is not used
for filtering.</p>
+<div class="codehilite"><pre><span></span><code><span class="k">SELECT</span>
<span class="k">COUNT</span><span class="p">(</span><span
class="o">*</span><span class="p">)</span>
+<span class="k">FROM</span> <span class="n">sensor_data</span>
+<span class="k">WHERE</span> <span class="n">date_time</span> <span
class="o">></span> <span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span> <span class="o">=</span>
<span class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, again, no columns are cached, because the output projection
is empty after query plan optimization.</p>
+<h3>Then why cache 2 pages per column instead of 1?</h3>
+<p>This is another real-world nuance regarding how Parquet layouts the
pages.</p>
+<p>Parquet by default encodes data using <a
href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">dictionary
encoding</a>, which writes a dictionary page as the first page of a column
chunk, followed by the keys referencing the dictionary.</p>
+<p>You can see this in action using <a
href="https://parquet-viewer.xiangpeng.systems">parquet-viewer</a>:</p>
+<figure>
+<img alt="Parquet viewer shows the page layout of a column chunk"
class="img-responsive" src="/blog/images/parquet-pushdown/parquet-viewer.jpg"
width="80%"/>
+<figcaption>
+ Parquet viewer shows the page layout of a column chunk
+ </figcaption>
+</figure>
+<p>This means that to decode a page of data, it actually references two pages:
the dictionary page and the data page.</p>
+<p>This is why it caches 2 pages per column: one dictionary page and one data
page.
+The data page slot will move forward as it reads the data; but the dictionary
page slot always references the first page.</p>
+<figure>
+<img alt="Cached two pages, one for dictionary (pinned), one for data (moves
as it reads the data)" class="img-responsive"
src="/blog/images/parquet-pushdown/cached-pages.jpg" width="80%"/>
+<figcaption>
+ Cached two pages, one for dictionary (pinned), one for data (moves as it
reads the data)
+ </figcaption>
+</figure>
+<h2>How does it perform?</h2>
+<p>Here are my results on <a
href="https://github.com/apache/datafusion/tree/main/benchmarks#clickbench">ClickBench</a>
on my AMD 9900X machine. The total time is reduced by 15%, with Q23 being
2.24x faster,
+and queries that get slower are likely due to noise.</p>
+<div class="codehilite"><pre><span></span><code><span
class="err">┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓</span>
+<span class="err">┃ Query ┃ no-pushdown ┃ new-pushdown ┃ Change
┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩</span>
+<span class="err">│ QQuery 0 │ 0.47ms │ 0.43ms
│ +1.10x faster │</span>
+<span class="err">│ QQuery 1 │ 51.10ms │ 50.10ms
│ no change │</span>
+<span class="err">│ QQuery 2 │ 68.23ms │ 64.49ms
│ +1.06x faster │</span>
+<span class="err">│ QQuery 3 │ 90.68ms │ 86.73ms
│ no change │</span>
+<span class="err">│ QQuery 4 │ 458.93ms │ 458.59ms
│ no change │</span>
+<span class="err">│ QQuery 5 │ 522.06ms │ 478.50ms
│ +1.09x faster │</span>
+<span class="err">│ QQuery 6 │ 49.84ms │ 49.94ms
│ no change │</span>
+<span class="err">│ QQuery 7 │ 55.09ms │ 55.77ms
│ no change │</span>
+<span class="err">│ QQuery 8 │ 565.26ms │ 556.95ms
│ no change │</span>
+<span class="err">│ QQuery 9 │ 575.83ms │ 575.05ms
│ no change │</span>
+<span class="err">│ QQuery 10 │ 164.56ms │ 178.23ms
│ 1.08x slower │</span>
+<span class="err">│ QQuery 11 │ 177.20ms │ 191.32ms
│ 1.08x slower │</span>
+<span class="err">│ QQuery 12 │ 591.05ms │ 569.92ms
│ no change │</span>
+<span class="err">│ QQuery 13 │ 861.06ms │ 848.59ms
│ no change │</span>
+<span class="err">│ QQuery 14 │ 596.20ms │ 580.73ms
│ no change │</span>
+<span class="err">│ QQuery 15 │ 554.96ms │ 548.77ms
│ no change │</span>
+<span class="err">│ QQuery 16 │ 1175.08ms │ 1146.07ms
│ no change │</span>
+<span class="err">│ QQuery 17 │ 1150.45ms │ 1121.49ms
│ no change │</span>
+<span class="err">│ QQuery 18 │ 2634.75ms │ 2494.07ms
│ +1.06x faster │</span>
+<span class="err">│ QQuery 19 │ 90.15ms │ 89.24ms
│ no change │</span>
+<span class="err">│ QQuery 20 │ 620.15ms │ 591.67ms
│ no change │</span>
+<span class="err">│ QQuery 21 │ 782.38ms │ 703.15ms
│ +1.11x faster │</span>
+<span class="err">│ QQuery 22 │ 1927.94ms │ 1404.35ms
│ +1.37x faster │</span>
+<span class="err">│ QQuery 23 │ 8104.11ms │ 3610.76ms
│ +2.24x faster │</span>
+<span class="err">│ QQuery 24 │ 360.79ms │ 330.55ms
│ +1.09x faster │</span>
+<span class="err">│ QQuery 25 │ 290.61ms │ 252.54ms
│ +1.15x faster │</span>
+<span class="err">│ QQuery 26 │ 395.18ms │ 362.72ms
│ +1.09x faster │</span>
+<span class="err">│ QQuery 27 │ 891.76ms │ 959.39ms
│ 1.08x slower │</span>
+<span class="err">│ QQuery 28 │ 4059.54ms │ 4137.37ms
│ no change │</span>
+<span class="err">│ QQuery 29 │ 235.88ms │ 228.99ms
│ no change │</span>
+<span class="err">│ QQuery 30 │ 564.22ms │ 584.65ms
│ no change │</span>
+<span class="err">│ QQuery 31 │ 741.20ms │ 757.87ms
│ no change │</span>
+<span class="err">│ QQuery 32 │ 2652.48ms │ 2574.19ms
│ no change │</span>
+<span class="err">│ QQuery 33 │ 2373.71ms │ 2327.10ms
│ no change │</span>
+<span class="err">│ QQuery 34 │ 2391.00ms │ 2342.15ms
│ no change │</span>
+<span class="err">│ QQuery 35 │ 700.79ms │ 694.51ms
│ no change │</span>
+<span class="err">│ QQuery 36 │ 151.51ms │ 152.93ms
│ no change │</span>
+<span class="err">│ QQuery 37 │ 108.18ms │ 86.03ms
│ +1.26x faster │</span>
+<span class="err">│ QQuery 38 │ 114.64ms │ 106.22ms
│ +1.08x faster │</span>
+<span class="err">│ QQuery 39 │ 260.80ms │ 239.13ms
│ +1.09x faster │</span>
+<span class="err">│ QQuery 40 │ 60.74ms │ 73.29ms
│ 1.21x slower │</span>
+<span class="err">│ QQuery 41 │ 58.75ms │ 67.85ms
│ 1.15x slower │</span>
+<span class="err">│ QQuery 42 │ 65.49ms │ 68.11ms
│ no change │</span>
+<span
class="err">└──────────────┴─────────────┴──────────────┴───────────────┘</span>
+<span class="err">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓</span>
+<span class="err">┃ Benchmark Summary ┃ ┃</span>
+<span class="err">┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩</span>
+<span class="err">│ Total Time (no-pushdown) │ 38344.79ms
│</span>
+<span class="err">│ Total Time (new-pushdown) │ 32800.50ms
│</span>
+<span class="err">│ Average Time (no-pushdown) │ 891.74ms
│</span>
+<span class="err">│ Average Time (new-pushdown) │ 762.80ms
│</span>
+<span class="err">│ Queries Faster │ 13
│</span>
+<span class="err">│ Queries Slower │ 5
│</span>
+<span class="err">│ Queries with No Change │ 25
│</span>
+<span
class="err">└─────────────────────────────┴────────────┘</span>
+</code></pre></div>
+<h2>Conclusion</h2>
+<p>Despite being simple in theory, filter pushdown in Parquet is non-trivial
to implement.
+It requires understanding both the Parquet format and reader implementation
details.
+The challenge lies in efficiently navigating through the dynamics of decoding,
filter evaluation, and memory management.</p>
+<p>If you are interested in this level of optimization and want to help test,
document and implement this type of optimization, come find us in the <a
href="https://datafusion.apache.org/contributor-guide/communication.html">DataFusion
Community</a>. We would love to have you. </p>
+ </div>
+ </div>
+ </div>
+ <!-- footer -->
+ <div class="row">
+ <div class="large-12 medium-12 columns">
+ <p style="font-style: italic; font-size: 0.8rem; text-align: center;">
+ Copyright 2025, <a href="https://www.apache.org/">The Apache
Software Foundation</a>, Licensed under the <a
href="https://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.<br/>
+ Apache® and the Apache feather logo are trademarks of The Apache
Software Foundation.
+ </p>
+ </div>
+ </div>
+ <script src="/blog/js/bootstrap.bundle.min.js"></script> </main>
+ </body>
+</html>
diff --git a/output/author/xiangpeng-hao.html b/output/author/xiangpeng-hao.html
index f2303b8..20b471a 100644
--- a/output/author/xiangpeng-hao.html
+++ b/output/author/xiangpeng-hao.html
@@ -47,6 +47,63 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/03/21/parquet-pushdown">Efficient Filter Pushdown in
Parquet</a></h1>
+ <p>Posted on: Fri 21 March 2025 by Xiangpeng Hao</p>
+ <p><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng Hao's
blog</a>. Thanks to <a href="https://www.influxdata.com/">InfluxData</a> for
sponsoring this work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/03/21/parquet-pushdown" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
diff --git a/output/category/blog.html b/output/category/blog.html
index 56f617b..33393cc 100644
--- a/output/category/blog.html
+++ b/output/category/blog.html
@@ -47,6 +47,63 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/03/21/parquet-pushdown">Efficient Filter Pushdown in
Parquet</a></h1>
+ <p>Posted on: Fri 21 March 2025 by Xiangpeng Hao</p>
+ <p><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng Hao's
blog</a>. Thanks to <a href="https://www.influxdata.com/">InfluxData</a> for
sponsoring this work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/03/21/parquet-pushdown" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
diff --git a/output/feed.xml b/output/feed.xml
index fccf433..fcdbe66 100644
--- a/output/feed.xml
+++ b/output/feed.xml
@@ -1,5 +1,43 @@
<?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
20 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Apache DataFusion Comet
0.7.0
Release</title><link>https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0.7.0</link><description><!--
+<rss version="2.0"><channel><title>Apache DataFusion
Blog</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Fri,
21 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Efficient Filter
Pushdown in
Parquet</title><link>https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown</link><description><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Xiangpeng
Hao</dc:creator><pubDate>Fri, 21 Mar 2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-03-21:/blog/2025/03/21/parquet-pushdown</guid><category>blog</category></item><item><title>Apache
DataFusion Comet 0.7.0 Release</title><link>https://datafusion [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index 2dc033d..450d535 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -1,5 +1,293 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-20T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Comet 0.7.0 Release</title><link
href="https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0.7.0" rel
[...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion
Blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/all-en.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-21T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Efficient
Filter Pushdown in Parquet</title><link
href="https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown"
rel="alter [...]
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></summary><content type="html"><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post</a>, we discussed how <a
href="https://datafusion.apache.org/">Apache DataFusion</a> prunes
<a href="https://parquet.apache.org/">Apache Parquet</a> files to
skip irrelevant <strong>files/row_groups</strong> (sometimes also
<a
href="https://parquet.apache.org/docs/file-format/pageindex/">pages</a>).</p>
+<p>This post discusses how Parquet readers skip irrelevant
<strong>rows</strong> while scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.</p>
+<h2>Why filter pushdown in Parquet?</h2>
+<p>Below is an example query that reads sensor data with filters on
<code>date_time</code> and <code>location</code>.
+Without filter pushdown, all rows from <code>location</code>,
<code>val</code>, and <code>date_time</code> columns
are decoded before <code>location='office'</code> is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span><span
class="p">,</span> <span class="k">location</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<figure>
+<img alt="Parquet pruning skips irrelevant files/row_groups, while filter
pushdown skips irrelevant rows. Without filter pushdown, all rows from
location, val, and date_time columns are decoded before `location='office'` is
evaluated. Filter pushdown is especially useful when the filter is selective,
i.e., removes many rows." class="img-responsive"
src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" width="80%"/>
+<figcaption>
+ Parquet pruning skips irrelevant files/row_groups, while filter pushdown
skips irrelevant rows. Without filter pushdown, all rows from location, val,
and date_time columns are decoded before `location='office'` is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.
+ </figcaption>
+</figure>
+<p>In our setup, sensor data is aggregated by date &mdash; each day
has its own Parquet file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e.,
<code>2025-03-10.parquet</code> and
<code>2025-03-11.parquet</code>.</p>
+<p>Once the files to read are located, the <a
href="https://github.com/apache/datafusion/issues/3463"><em>DataFusion's
current default implementation</em></a> reads all the projected
columns (<code>sensor_id</code>, <code>val</code>, and
<code>location</code>) into Arrow RecordBatches, then applies the
filters over <code>location</code> to get the final set of
rows.</p>
+<p>A better approach is called <strong>filter
pushdown</strong> with <strong>late materialization</strong>,
which evaluates filter conditions first and only decodes data that passes these
conditions.
+In practice, this works by first processing only the filter columns
(<code>date_time</code> and <code>location</code>),
building a boolean mask of rows that satisfy our conditions, then using this
mask to selectively decode only the relevant rows from other columns
(<code>sensor_id</code>, <code>val</code>).
+This eliminates the waste of decoding rows that will be immediately filtered
out.</p>
+<p>While simple in theory, practical implementations often make
performance worse.</p>
+<h2>How can filter pushdown be slower?</h2>
+<p>At a high level, the Parquet reader first builds a filter mask --
essentially a boolean array indicating which rows meet the filter criteria --
and then uses this mask to selectively decode only the needed rows from the
remaining columns in the projection.</p>
+<p>Let's dig into details of <a
href="https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712">how
filter pushdown is implemented</a> in the current Rust Parquet reader
implementation, illustrated in the following figure.</p>
+<figure>
+<img alt="Implementation of filter pushdown in Rust Parquet readers"
class="img-responsive" src="/blog/images/parquet-pushdown/baseline-impl.jpg"
with="70%"/>
+<figcaption>
+ Implementation of filter pushdown in Rust Parquet readers -- the first
phase builds the filter mask, the second phase applies the filter mask to the
other columns
+ </figcaption>
+</figure>
+<p>The filter pushdown has two phases:</p>
+<ol>
+<li>
+<p>Build the filter mask (steps 1-3)</p>
+</li>
+<li>
+<p>Use the filter mask to selectively decode other columns (steps 4-7),
e.g., output step 3 is used as input for step 5 and 7.</p>
+</li>
+</ol>
+<p>Within each phase, it takes three steps from Parquet to
Arrow:</p>
+<ol>
+<li>
+<p>Decompress the Parquet pages using generic decompression algorithms
like LZ4, Zstd, etc. (steps 1, 4, 6)</p>
+</li>
+<li>
+<p>Decode the page content into Arrow format (steps 2, 5, 7)</p>
+</li>
+<li>
+<p>Evaluate the filter over Arrow data (step 3)</p>
+</li>
+</ol>
+<p>In the figure above, we can see that
<code>location</code> is <strong>decompressed and decoded
twice</strong>, first when building the filter mask (steps 1, 2), and
second when building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and
output.</p>
+<p>The table below shows the corresponding CPU time on the <a
href="https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql#L23">ClickBench
query 22</a>:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| Decompress | Decode | Apply filter| Others
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| 206 ms | 117 ms | 22 ms | 48 ms
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+</code></pre></div>
+<p>Clearly, decompress/decode operations dominate the time spent. With
filter pushdown, it needs to decompress/decode twice; but without filter
pushdown, it only needs to do this once.
+This explains why filter pushdown is slower in some cases.</p>
+<blockquote>
+<p><strong>Note:</strong> Highly selective filters may skip
the entire page; but as long as it reads one row from the page, it needs to
decompress and often decode the entire page.</p>
+</blockquote>
+<h2>Attempt: cache filter columns</h2>
+<p>Intuitively, caching the filter columns and reusing them later could
help.</p>
+<p>But naively caching decoded pages consumes prohibitively high
memory:</p>
+<ol>
+<li>
+<p>It needs to cache Arrow arrays, which are on average <a
href="https://github.com/XiangpengHao/liquid-cache/blob/main/dev/doc/liquid-cache-vldb.pdf">4x
larger than Parquet data</a>.</p>
+</li>
+<li>
+<p>It needs to cache the <strong>entire column chunk in
memory</strong>, because in Phase 1 it builds filters over the column
chunk, and only use it in Phase 2. </p>
+</li>
+<li>
+<p>The memory usage is proportional to the number of filter columns,
which can be unboundedly high. </p>
+</li>
+</ol>
+<p>Worse, caching filter columns means it needs to read partially from
Parquet and partially from cache, which is complex to implement, likely
requiring a substantial change to the current implementation. </p>
+<blockquote>
+<p><strong>Feel the complexity:</strong> consider building a
cache that properly handles nested columns, multiple filters, and filters with
multiple columns.</p>
+</blockquote>
+<h2>Real solution</h2>
+<p>We need a solution that:</p>
+<ol>
+<li>
+<p>Is simple to implement, i.e., doesn't require thousands of lines of
code.</p>
+</li>
+<li>
+<p>Incurs minimal memory overhead.</p>
+</li>
+</ol>
+<p>This section describes my <a
href="https://github.com/apache/arrow-rs/pull/6921#issuecomment-2718792433">&lt;700
LOC PR (with lots of comments and tests)</a> that <strong>reduces
total ClickBench time by 15%, with up to 2x lower latency for some queries, no
obvious regression on other queries, and caches at most 2 pages (~2MB) per
column in memory</strong>.</p>
+<figure>
+<img alt="New decoding pipeline, building filter mask and output columns
are interleaved in a single pass, allowing us to cache minimal pages for
minimal amount of time" class="img-responsive"
src="/blog/images/parquet-pushdown/new-pipeline.jpg" width="80%"/>
+<figcaption>
+ New decoding pipeline, building filter mask and output columns are
interleaved in a single pass, allowing us to cache minimal pages for minimal
amount of time
+ </figcaption>
+</figure>
+<p>The new pipeline interleaves the previous two phases into a single
pass, so that:</p>
+<ol>
+<li>
+<p>The page being decompressed is immediately used to build filter masks
and output columns.</p>
+</li>
+<li>
+<p>Decompressed pages are cached for minimal time; after one pass (steps
1-6), the cache memory is released for the next pass. </p>
+</li>
+</ol>
+<p>This allows the cache to only hold 1 page at a time, and to
immediately discard the previous page after it's used, significantly reducing
the memory requirement for caching.</p>
+<h3>What pages are cached?</h3>
+<p>You may have noticed that only <code>location</code> is
cached, not <code>val</code>, because <code>val</code>
is only used for output.
+More generally, only columns that appear both in the filter and output are
cached, and at most 1 page is cached for each such column.</p>
+<p>More examples:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, no columns are cached, because
<code>val</code> is not used for filtering.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span
class="k">COUNT</span><span class="p">(</span><span
class="o">*</span><span class="p">)</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, again, no columns are cached, because the output
projection is empty after query plan optimization.</p>
+<h3>Then why cache 2 pages per column instead of 1?</h3>
+<p>This is another real-world nuance regarding how Parquet layouts the
pages.</p>
+<p>Parquet by default encodes data using <a
href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">dictionary
encoding</a>, which writes a dictionary page as the first page of a
column chunk, followed by the keys referencing the dictionary.</p>
+<p>You can see this in action using <a
href="https://parquet-viewer.xiangpeng.systems">parquet-viewer</a>:</p>
+<figure>
+<img alt="Parquet viewer shows the page layout of a column chunk"
class="img-responsive" src="/blog/images/parquet-pushdown/parquet-viewer.jpg"
width="80%"/>
+<figcaption>
+ Parquet viewer shows the page layout of a column chunk
+ </figcaption>
+</figure>
+<p>This means that to decode a page of data, it actually references two
pages: the dictionary page and the data page.</p>
+<p>This is why it caches 2 pages per column: one dictionary page and one
data page.
+The data page slot will move forward as it reads the data; but the dictionary
page slot always references the first page.</p>
+<figure>
+<img alt="Cached two pages, one for dictionary (pinned), one for data
(moves as it reads the data)" class="img-responsive"
src="/blog/images/parquet-pushdown/cached-pages.jpg" width="80%"/>
+<figcaption>
+ Cached two pages, one for dictionary (pinned), one for data (moves as it
reads the data)
+ </figcaption>
+</figure>
+<h2>How does it perform?</h2>
+<p>Here are my results on <a
href="https://github.com/apache/datafusion/tree/main/benchmarks#clickbench">ClickBench</a>
on my AMD 9900X machine. The total time is reduced by 15%, with Q23 being
2.24x faster,
+and queries that get slower are likely due to noise.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓</span>
+<span class="err">┃ Query ┃ no-pushdown ┃ new-pushdown ┃
Change ┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; QQuery 0 &boxv; 0.47ms
&boxv; 0.43ms &boxv; +1.10x faster &boxv;</span>
+<span class="err">&boxv; QQuery 1 &boxv; 51.10ms
&boxv; 50.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 2 &boxv; 68.23ms
&boxv; 64.49ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 3 &boxv; 90.68ms
&boxv; 86.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 4 &boxv; 458.93ms
&boxv; 458.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 5 &boxv; 522.06ms
&boxv; 478.50ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 6 &boxv; 49.84ms
&boxv; 49.94ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 7 &boxv; 55.09ms
&boxv; 55.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 8 &boxv; 565.26ms
&boxv; 556.95ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 9 &boxv; 575.83ms
&boxv; 575.05ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 10 &boxv; 164.56ms
&boxv; 178.23ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 11 &boxv; 177.20ms
&boxv; 191.32ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 12 &boxv; 591.05ms
&boxv; 569.92ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 13 &boxv; 861.06ms
&boxv; 848.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 14 &boxv; 596.20ms
&boxv; 580.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 15 &boxv; 554.96ms
&boxv; 548.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 16 &boxv; 1175.08ms
&boxv; 1146.07ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 17 &boxv; 1150.45ms
&boxv; 1121.49ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 18 &boxv; 2634.75ms
&boxv; 2494.07ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 19 &boxv; 90.15ms
&boxv; 89.24ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 20 &boxv; 620.15ms
&boxv; 591.67ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 21 &boxv; 782.38ms
&boxv; 703.15ms &boxv; +1.11x faster &boxv;</span>
+<span class="err">&boxv; QQuery 22 &boxv; 1927.94ms
&boxv; 1404.35ms &boxv; +1.37x faster &boxv;</span>
+<span class="err">&boxv; QQuery 23 &boxv; 8104.11ms
&boxv; 3610.76ms &boxv; +2.24x faster &boxv;</span>
+<span class="err">&boxv; QQuery 24 &boxv; 360.79ms
&boxv; 330.55ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 25 &boxv; 290.61ms
&boxv; 252.54ms &boxv; +1.15x faster &boxv;</span>
+<span class="err">&boxv; QQuery 26 &boxv; 395.18ms
&boxv; 362.72ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 27 &boxv; 891.76ms
&boxv; 959.39ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 28 &boxv; 4059.54ms
&boxv; 4137.37ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 29 &boxv; 235.88ms
&boxv; 228.99ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 30 &boxv; 564.22ms
&boxv; 584.65ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 31 &boxv; 741.20ms
&boxv; 757.87ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 32 &boxv; 2652.48ms
&boxv; 2574.19ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 33 &boxv; 2373.71ms
&boxv; 2327.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 34 &boxv; 2391.00ms
&boxv; 2342.15ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 35 &boxv; 700.79ms
&boxv; 694.51ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 36 &boxv; 151.51ms
&boxv; 152.93ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 37 &boxv; 108.18ms
&boxv; 86.03ms &boxv; +1.26x faster &boxv;</span>
+<span class="err">&boxv; QQuery 38 &boxv; 114.64ms
&boxv; 106.22ms &boxv; +1.08x faster &boxv;</span>
+<span class="err">&boxv; QQuery 39 &boxv; 260.80ms
&boxv; 239.13ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 40 &boxv; 60.74ms
&boxv; 73.29ms &boxv; 1.21x slower &boxv;</span>
+<span class="err">&boxv; QQuery 41 &boxv; 58.75ms
&boxv; 67.85ms &boxv; 1.15x slower &boxv;</span>
+<span class="err">&boxv; QQuery 42 &boxv; 65.49ms
&boxv; 68.11ms &boxv; no change &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&
[...]
+<span
class="err">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓</span>
+<span class="err">┃ Benchmark Summary ┃
┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; Total Time (no-pushdown) &boxv;
38344.79ms &boxv;</span>
+<span class="err">&boxv; Total Time (new-pushdown) &boxv;
32800.50ms &boxv;</span>
+<span class="err">&boxv; Average Time (no-pushdown) &boxv;
891.74ms &boxv;</span>
+<span class="err">&boxv; Average Time (new-pushdown) &boxv;
762.80ms &boxv;</span>
+<span class="err">&boxv; Queries Faster &boxv;
13 &boxv;</span>
+<span class="err">&boxv; Queries Slower &boxv;
5 &boxv;</span>
+<span class="err">&boxv; Queries with No Change &boxv;
25 &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxul;</span>
+</code></pre></div>
+<h2>Conclusion</h2>
+<p>Despite being simple in theory, filter pushdown in Parquet is
non-trivial to implement.
+It requires understanding both the Parquet format and reader implementation
details.
+The challenge lies in efficiently navigating through the dynamics of decoding,
filter evaluation, and memory management.</p>
+<p>If you are interested in this level of optimization and want to help
test, document and implement this type of optimization, come find us in the
<a
href="https://datafusion.apache.org/contributor-guide/communication.html">DataFusion
Community</a>. We would love to have you. </p></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.7.0
Release</title><link
href="https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0 [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index 5de58a8..dbe61aa 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -1,5 +1,293 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-20T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Apache
DataFusion Comet 0.7.0 Release</title><link
href="https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0.7.0 [...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
blog</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/blog.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-21T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Efficient
Filter Pushdown in Parquet</title><link
href="https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown" rel="
[...]
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></summary><content type="html"><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post</a>, we discussed how <a
href="https://datafusion.apache.org/">Apache DataFusion</a> prunes
<a href="https://parquet.apache.org/">Apache Parquet</a> files to
skip irrelevant <strong>files/row_groups</strong> (sometimes also
<a
href="https://parquet.apache.org/docs/file-format/pageindex/">pages</a>).</p>
+<p>This post discusses how Parquet readers skip irrelevant
<strong>rows</strong> while scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.</p>
+<h2>Why filter pushdown in Parquet?</h2>
+<p>Below is an example query that reads sensor data with filters on
<code>date_time</code> and <code>location</code>.
+Without filter pushdown, all rows from <code>location</code>,
<code>val</code>, and <code>date_time</code> columns
are decoded before <code>location='office'</code> is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span><span
class="p">,</span> <span class="k">location</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<figure>
+<img alt="Parquet pruning skips irrelevant files/row_groups, while filter
pushdown skips irrelevant rows. Without filter pushdown, all rows from
location, val, and date_time columns are decoded before `location='office'` is
evaluated. Filter pushdown is especially useful when the filter is selective,
i.e., removes many rows." class="img-responsive"
src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" width="80%"/>
+<figcaption>
+ Parquet pruning skips irrelevant files/row_groups, while filter pushdown
skips irrelevant rows. Without filter pushdown, all rows from location, val,
and date_time columns are decoded before `location='office'` is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.
+ </figcaption>
+</figure>
+<p>In our setup, sensor data is aggregated by date &mdash; each day
has its own Parquet file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e.,
<code>2025-03-10.parquet</code> and
<code>2025-03-11.parquet</code>.</p>
+<p>Once the files to read are located, the <a
href="https://github.com/apache/datafusion/issues/3463"><em>DataFusion's
current default implementation</em></a> reads all the projected
columns (<code>sensor_id</code>, <code>val</code>, and
<code>location</code>) into Arrow RecordBatches, then applies the
filters over <code>location</code> to get the final set of
rows.</p>
+<p>A better approach is called <strong>filter
pushdown</strong> with <strong>late materialization</strong>,
which evaluates filter conditions first and only decodes data that passes these
conditions.
+In practice, this works by first processing only the filter columns
(<code>date_time</code> and <code>location</code>),
building a boolean mask of rows that satisfy our conditions, then using this
mask to selectively decode only the relevant rows from other columns
(<code>sensor_id</code>, <code>val</code>).
+This eliminates the waste of decoding rows that will be immediately filtered
out.</p>
+<p>While simple in theory, practical implementations often make
performance worse.</p>
+<h2>How can filter pushdown be slower?</h2>
+<p>At a high level, the Parquet reader first builds a filter mask --
essentially a boolean array indicating which rows meet the filter criteria --
and then uses this mask to selectively decode only the needed rows from the
remaining columns in the projection.</p>
+<p>Let's dig into details of <a
href="https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712">how
filter pushdown is implemented</a> in the current Rust Parquet reader
implementation, illustrated in the following figure.</p>
+<figure>
+<img alt="Implementation of filter pushdown in Rust Parquet readers"
class="img-responsive" src="/blog/images/parquet-pushdown/baseline-impl.jpg"
with="70%"/>
+<figcaption>
+ Implementation of filter pushdown in Rust Parquet readers -- the first
phase builds the filter mask, the second phase applies the filter mask to the
other columns
+ </figcaption>
+</figure>
+<p>The filter pushdown has two phases:</p>
+<ol>
+<li>
+<p>Build the filter mask (steps 1-3)</p>
+</li>
+<li>
+<p>Use the filter mask to selectively decode other columns (steps 4-7),
e.g., output step 3 is used as input for step 5 and 7.</p>
+</li>
+</ol>
+<p>Within each phase, it takes three steps from Parquet to
Arrow:</p>
+<ol>
+<li>
+<p>Decompress the Parquet pages using generic decompression algorithms
like LZ4, Zstd, etc. (steps 1, 4, 6)</p>
+</li>
+<li>
+<p>Decode the page content into Arrow format (steps 2, 5, 7)</p>
+</li>
+<li>
+<p>Evaluate the filter over Arrow data (step 3)</p>
+</li>
+</ol>
+<p>In the figure above, we can see that
<code>location</code> is <strong>decompressed and decoded
twice</strong>, first when building the filter mask (steps 1, 2), and
second when building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and
output.</p>
+<p>The table below shows the corresponding CPU time on the <a
href="https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql#L23">ClickBench
query 22</a>:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| Decompress | Decode | Apply filter| Others
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| 206 ms | 117 ms | 22 ms | 48 ms
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+</code></pre></div>
+<p>Clearly, decompress/decode operations dominate the time spent. With
filter pushdown, it needs to decompress/decode twice; but without filter
pushdown, it only needs to do this once.
+This explains why filter pushdown is slower in some cases.</p>
+<blockquote>
+<p><strong>Note:</strong> Highly selective filters may skip
the entire page; but as long as it reads one row from the page, it needs to
decompress and often decode the entire page.</p>
+</blockquote>
+<h2>Attempt: cache filter columns</h2>
+<p>Intuitively, caching the filter columns and reusing them later could
help.</p>
+<p>But naively caching decoded pages consumes prohibitively high
memory:</p>
+<ol>
+<li>
+<p>It needs to cache Arrow arrays, which are on average <a
href="https://github.com/XiangpengHao/liquid-cache/blob/main/dev/doc/liquid-cache-vldb.pdf">4x
larger than Parquet data</a>.</p>
+</li>
+<li>
+<p>It needs to cache the <strong>entire column chunk in
memory</strong>, because in Phase 1 it builds filters over the column
chunk, and only use it in Phase 2. </p>
+</li>
+<li>
+<p>The memory usage is proportional to the number of filter columns,
which can be unboundedly high. </p>
+</li>
+</ol>
+<p>Worse, caching filter columns means it needs to read partially from
Parquet and partially from cache, which is complex to implement, likely
requiring a substantial change to the current implementation. </p>
+<blockquote>
+<p><strong>Feel the complexity:</strong> consider building a
cache that properly handles nested columns, multiple filters, and filters with
multiple columns.</p>
+</blockquote>
+<h2>Real solution</h2>
+<p>We need a solution that:</p>
+<ol>
+<li>
+<p>Is simple to implement, i.e., doesn't require thousands of lines of
code.</p>
+</li>
+<li>
+<p>Incurs minimal memory overhead.</p>
+</li>
+</ol>
+<p>This section describes my <a
href="https://github.com/apache/arrow-rs/pull/6921#issuecomment-2718792433">&lt;700
LOC PR (with lots of comments and tests)</a> that <strong>reduces
total ClickBench time by 15%, with up to 2x lower latency for some queries, no
obvious regression on other queries, and caches at most 2 pages (~2MB) per
column in memory</strong>.</p>
+<figure>
+<img alt="New decoding pipeline, building filter mask and output columns
are interleaved in a single pass, allowing us to cache minimal pages for
minimal amount of time" class="img-responsive"
src="/blog/images/parquet-pushdown/new-pipeline.jpg" width="80%"/>
+<figcaption>
+ New decoding pipeline, building filter mask and output columns are
interleaved in a single pass, allowing us to cache minimal pages for minimal
amount of time
+ </figcaption>
+</figure>
+<p>The new pipeline interleaves the previous two phases into a single
pass, so that:</p>
+<ol>
+<li>
+<p>The page being decompressed is immediately used to build filter masks
and output columns.</p>
+</li>
+<li>
+<p>Decompressed pages are cached for minimal time; after one pass (steps
1-6), the cache memory is released for the next pass. </p>
+</li>
+</ol>
+<p>This allows the cache to only hold 1 page at a time, and to
immediately discard the previous page after it's used, significantly reducing
the memory requirement for caching.</p>
+<h3>What pages are cached?</h3>
+<p>You may have noticed that only <code>location</code> is
cached, not <code>val</code>, because <code>val</code>
is only used for output.
+More generally, only columns that appear both in the filter and output are
cached, and at most 1 page is cached for each such column.</p>
+<p>More examples:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, no columns are cached, because
<code>val</code> is not used for filtering.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span
class="k">COUNT</span><span class="p">(</span><span
class="o">*</span><span class="p">)</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, again, no columns are cached, because the output
projection is empty after query plan optimization.</p>
+<h3>Then why cache 2 pages per column instead of 1?</h3>
+<p>This is another real-world nuance regarding how Parquet layouts the
pages.</p>
+<p>Parquet by default encodes data using <a
href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">dictionary
encoding</a>, which writes a dictionary page as the first page of a
column chunk, followed by the keys referencing the dictionary.</p>
+<p>You can see this in action using <a
href="https://parquet-viewer.xiangpeng.systems">parquet-viewer</a>:</p>
+<figure>
+<img alt="Parquet viewer shows the page layout of a column chunk"
class="img-responsive" src="/blog/images/parquet-pushdown/parquet-viewer.jpg"
width="80%"/>
+<figcaption>
+ Parquet viewer shows the page layout of a column chunk
+ </figcaption>
+</figure>
+<p>This means that to decode a page of data, it actually references two
pages: the dictionary page and the data page.</p>
+<p>This is why it caches 2 pages per column: one dictionary page and one
data page.
+The data page slot will move forward as it reads the data; but the dictionary
page slot always references the first page.</p>
+<figure>
+<img alt="Cached two pages, one for dictionary (pinned), one for data
(moves as it reads the data)" class="img-responsive"
src="/blog/images/parquet-pushdown/cached-pages.jpg" width="80%"/>
+<figcaption>
+ Cached two pages, one for dictionary (pinned), one for data (moves as it
reads the data)
+ </figcaption>
+</figure>
+<h2>How does it perform?</h2>
+<p>Here are my results on <a
href="https://github.com/apache/datafusion/tree/main/benchmarks#clickbench">ClickBench</a>
on my AMD 9900X machine. The total time is reduced by 15%, with Q23 being
2.24x faster,
+and queries that get slower are likely due to noise.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓</span>
+<span class="err">┃ Query ┃ no-pushdown ┃ new-pushdown ┃
Change ┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; QQuery 0 &boxv; 0.47ms
&boxv; 0.43ms &boxv; +1.10x faster &boxv;</span>
+<span class="err">&boxv; QQuery 1 &boxv; 51.10ms
&boxv; 50.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 2 &boxv; 68.23ms
&boxv; 64.49ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 3 &boxv; 90.68ms
&boxv; 86.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 4 &boxv; 458.93ms
&boxv; 458.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 5 &boxv; 522.06ms
&boxv; 478.50ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 6 &boxv; 49.84ms
&boxv; 49.94ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 7 &boxv; 55.09ms
&boxv; 55.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 8 &boxv; 565.26ms
&boxv; 556.95ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 9 &boxv; 575.83ms
&boxv; 575.05ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 10 &boxv; 164.56ms
&boxv; 178.23ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 11 &boxv; 177.20ms
&boxv; 191.32ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 12 &boxv; 591.05ms
&boxv; 569.92ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 13 &boxv; 861.06ms
&boxv; 848.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 14 &boxv; 596.20ms
&boxv; 580.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 15 &boxv; 554.96ms
&boxv; 548.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 16 &boxv; 1175.08ms
&boxv; 1146.07ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 17 &boxv; 1150.45ms
&boxv; 1121.49ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 18 &boxv; 2634.75ms
&boxv; 2494.07ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 19 &boxv; 90.15ms
&boxv; 89.24ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 20 &boxv; 620.15ms
&boxv; 591.67ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 21 &boxv; 782.38ms
&boxv; 703.15ms &boxv; +1.11x faster &boxv;</span>
+<span class="err">&boxv; QQuery 22 &boxv; 1927.94ms
&boxv; 1404.35ms &boxv; +1.37x faster &boxv;</span>
+<span class="err">&boxv; QQuery 23 &boxv; 8104.11ms
&boxv; 3610.76ms &boxv; +2.24x faster &boxv;</span>
+<span class="err">&boxv; QQuery 24 &boxv; 360.79ms
&boxv; 330.55ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 25 &boxv; 290.61ms
&boxv; 252.54ms &boxv; +1.15x faster &boxv;</span>
+<span class="err">&boxv; QQuery 26 &boxv; 395.18ms
&boxv; 362.72ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 27 &boxv; 891.76ms
&boxv; 959.39ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 28 &boxv; 4059.54ms
&boxv; 4137.37ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 29 &boxv; 235.88ms
&boxv; 228.99ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 30 &boxv; 564.22ms
&boxv; 584.65ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 31 &boxv; 741.20ms
&boxv; 757.87ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 32 &boxv; 2652.48ms
&boxv; 2574.19ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 33 &boxv; 2373.71ms
&boxv; 2327.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 34 &boxv; 2391.00ms
&boxv; 2342.15ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 35 &boxv; 700.79ms
&boxv; 694.51ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 36 &boxv; 151.51ms
&boxv; 152.93ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 37 &boxv; 108.18ms
&boxv; 86.03ms &boxv; +1.26x faster &boxv;</span>
+<span class="err">&boxv; QQuery 38 &boxv; 114.64ms
&boxv; 106.22ms &boxv; +1.08x faster &boxv;</span>
+<span class="err">&boxv; QQuery 39 &boxv; 260.80ms
&boxv; 239.13ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 40 &boxv; 60.74ms
&boxv; 73.29ms &boxv; 1.21x slower &boxv;</span>
+<span class="err">&boxv; QQuery 41 &boxv; 58.75ms
&boxv; 67.85ms &boxv; 1.15x slower &boxv;</span>
+<span class="err">&boxv; QQuery 42 &boxv; 65.49ms
&boxv; 68.11ms &boxv; no change &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&
[...]
+<span
class="err">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓</span>
+<span class="err">┃ Benchmark Summary ┃
┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; Total Time (no-pushdown) &boxv;
38344.79ms &boxv;</span>
+<span class="err">&boxv; Total Time (new-pushdown) &boxv;
32800.50ms &boxv;</span>
+<span class="err">&boxv; Average Time (no-pushdown) &boxv;
891.74ms &boxv;</span>
+<span class="err">&boxv; Average Time (new-pushdown) &boxv;
762.80ms &boxv;</span>
+<span class="err">&boxv; Queries Faster &boxv;
13 &boxv;</span>
+<span class="err">&boxv; Queries Slower &boxv;
5 &boxv;</span>
+<span class="err">&boxv; Queries with No Change &boxv;
25 &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxul;</span>
+</code></pre></div>
+<h2>Conclusion</h2>
+<p>Despite being simple in theory, filter pushdown in Parquet is
non-trivial to implement.
+It requires understanding both the Parquet format and reader implementation
details.
+The challenge lies in efficiently navigating through the dynamics of decoding,
filter evaluation, and memory management.</p>
+<p>If you are interested in this level of optimization and want to help
test, document and implement this type of optimization, come find us in the
<a
href="https://datafusion.apache.org/contributor-guide/communication.html">DataFusion
Community</a>. We would love to have you. </p></content><category
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.7.0
Release</title><link
href="https://datafusion.apache.org/blog/2025/03/20/datafusion-comet-0 [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/xiangpeng-hao.atom.xml
b/output/feeds/xiangpeng-hao.atom.xml
index 8d6ba2f..0b818dd 100644
--- a/output/feeds/xiangpeng-hao.atom.xml
+++ b/output/feeds/xiangpeng-hao.atom.xml
@@ -1,5 +1,293 @@
<?xml version="1.0" encoding="utf-8"?>
-<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
Xiangpeng Hao</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/xiangpeng-hao.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-20T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Parquet
Pruning in DataFusion: Read Only What Matters</title><link
href="https://datafusion.apache.org/blog [...]
+<feed xmlns="http://www.w3.org/2005/Atom"><title>Apache DataFusion Blog -
Xiangpeng Hao</title><link href="https://datafusion.apache.org/blog/"
rel="alternate"></link><link
href="https://datafusion.apache.org/blog/feeds/xiangpeng-hao.atom.xml"
rel="self"></link><id>https://datafusion.apache.org/blog/</id><updated>2025-03-21T00:00:00+00:00</updated><subtitle></subtitle><entry><title>Efficient
Filter Pushdown in Parquet</title><link
href="https://datafusion.apache.org/blog/2025/03/21/parqu [...]
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></summary><content type="html"><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post</a>, we discussed how <a
href="https://datafusion.apache.org/">Apache DataFusion</a> prunes
<a href="https://parquet.apache.org/">Apache Parquet</a> files to
skip irrelevant <strong>files/row_groups</strong> (sometimes also
<a
href="https://parquet.apache.org/docs/file-format/pageindex/">pages</a>).</p>
+<p>This post discusses how Parquet readers skip irrelevant
<strong>rows</strong> while scanning data,
+leveraging Parquet's columnar layout by first reading only filter columns,
+and then selectively reading other columns only for matching rows.</p>
+<h2>Why filter pushdown in Parquet?</h2>
+<p>Below is an example query that reads sensor data with filters on
<code>date_time</code> and <code>location</code>.
+Without filter pushdown, all rows from <code>location</code>,
<code>val</code>, and <code>date_time</code> columns
are decoded before <code>location='office'</code> is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span><span
class="p">,</span> <span class="k">location</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<figure>
+<img alt="Parquet pruning skips irrelevant files/row_groups, while filter
pushdown skips irrelevant rows. Without filter pushdown, all rows from
location, val, and date_time columns are decoded before `location='office'` is
evaluated. Filter pushdown is especially useful when the filter is selective,
i.e., removes many rows." class="img-responsive"
src="/blog/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg" width="80%"/>
+<figcaption>
+ Parquet pruning skips irrelevant files/row_groups, while filter pushdown
skips irrelevant rows. Without filter pushdown, all rows from location, val,
and date_time columns are decoded before `location='office'` is evaluated.
Filter pushdown is especially useful when the filter is selective, i.e.,
removes many rows.
+ </figcaption>
+</figure>
+<p>In our setup, sensor data is aggregated by date &mdash; each day
has its own Parquet file.
+At planning time, DataFusion prunes the unneeded Parquet files, i.e.,
<code>2025-03-10.parquet</code> and
<code>2025-03-11.parquet</code>.</p>
+<p>Once the files to read are located, the <a
href="https://github.com/apache/datafusion/issues/3463"><em>DataFusion's
current default implementation</em></a> reads all the projected
columns (<code>sensor_id</code>, <code>val</code>, and
<code>location</code>) into Arrow RecordBatches, then applies the
filters over <code>location</code> to get the final set of
rows.</p>
+<p>A better approach is called <strong>filter
pushdown</strong> with <strong>late materialization</strong>,
which evaluates filter conditions first and only decodes data that passes these
conditions.
+In practice, this works by first processing only the filter columns
(<code>date_time</code> and <code>location</code>),
building a boolean mask of rows that satisfy our conditions, then using this
mask to selectively decode only the relevant rows from other columns
(<code>sensor_id</code>, <code>val</code>).
+This eliminates the waste of decoding rows that will be immediately filtered
out.</p>
+<p>While simple in theory, practical implementations often make
performance worse.</p>
+<h2>How can filter pushdown be slower?</h2>
+<p>At a high level, the Parquet reader first builds a filter mask --
essentially a boolean array indicating which rows meet the filter criteria --
and then uses this mask to selectively decode only the needed rows from the
remaining columns in the projection.</p>
+<p>Let's dig into details of <a
href="https://github.com/apache/arrow-rs/blob/d5339f31a60a4bd8a4256e7120fe32603249d88e/parquet/src/arrow/async_reader/mod.rs#L618-L712">how
filter pushdown is implemented</a> in the current Rust Parquet reader
implementation, illustrated in the following figure.</p>
+<figure>
+<img alt="Implementation of filter pushdown in Rust Parquet readers"
class="img-responsive" src="/blog/images/parquet-pushdown/baseline-impl.jpg"
with="70%"/>
+<figcaption>
+ Implementation of filter pushdown in Rust Parquet readers -- the first
phase builds the filter mask, the second phase applies the filter mask to the
other columns
+ </figcaption>
+</figure>
+<p>The filter pushdown has two phases:</p>
+<ol>
+<li>
+<p>Build the filter mask (steps 1-3)</p>
+</li>
+<li>
+<p>Use the filter mask to selectively decode other columns (steps 4-7),
e.g., output step 3 is used as input for step 5 and 7.</p>
+</li>
+</ol>
+<p>Within each phase, it takes three steps from Parquet to
Arrow:</p>
+<ol>
+<li>
+<p>Decompress the Parquet pages using generic decompression algorithms
like LZ4, Zstd, etc. (steps 1, 4, 6)</p>
+</li>
+<li>
+<p>Decode the page content into Arrow format (steps 2, 5, 7)</p>
+</li>
+<li>
+<p>Evaluate the filter over Arrow data (step 3)</p>
+</li>
+</ol>
+<p>In the figure above, we can see that
<code>location</code> is <strong>decompressed and decoded
twice</strong>, first when building the filter mask (steps 1, 2), and
second when building the output (steps 4, 5).
+This happens for all columns that appear both in the filter and
output.</p>
+<p>The table below shows the corresponding CPU time on the <a
href="https://github.com/apache/datafusion/blob/main/benchmarks/queries/clickbench/queries.sql#L23">ClickBench
query 22</a>:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| Decompress | Decode | Apply filter| Others
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+<span class="err">| 206 ms | 117 ms | 22 ms | 48 ms
|</span>
+<span
class="err">+------------+--------+-------------+--------+</span>
+</code></pre></div>
+<p>Clearly, decompress/decode operations dominate the time spent. With
filter pushdown, it needs to decompress/decode twice; but without filter
pushdown, it only needs to do this once.
+This explains why filter pushdown is slower in some cases.</p>
+<blockquote>
+<p><strong>Note:</strong> Highly selective filters may skip
the entire page; but as long as it reads one row from the page, it needs to
decompress and often decode the entire page.</p>
+</blockquote>
+<h2>Attempt: cache filter columns</h2>
+<p>Intuitively, caching the filter columns and reusing them later could
help.</p>
+<p>But naively caching decoded pages consumes prohibitively high
memory:</p>
+<ol>
+<li>
+<p>It needs to cache Arrow arrays, which are on average <a
href="https://github.com/XiangpengHao/liquid-cache/blob/main/dev/doc/liquid-cache-vldb.pdf">4x
larger than Parquet data</a>.</p>
+</li>
+<li>
+<p>It needs to cache the <strong>entire column chunk in
memory</strong>, because in Phase 1 it builds filters over the column
chunk, and only use it in Phase 2. </p>
+</li>
+<li>
+<p>The memory usage is proportional to the number of filter columns,
which can be unboundedly high. </p>
+</li>
+</ol>
+<p>Worse, caching filter columns means it needs to read partially from
Parquet and partially from cache, which is complex to implement, likely
requiring a substantial change to the current implementation. </p>
+<blockquote>
+<p><strong>Feel the complexity:</strong> consider building a
cache that properly handles nested columns, multiple filters, and filters with
multiple columns.</p>
+</blockquote>
+<h2>Real solution</h2>
+<p>We need a solution that:</p>
+<ol>
+<li>
+<p>Is simple to implement, i.e., doesn't require thousands of lines of
code.</p>
+</li>
+<li>
+<p>Incurs minimal memory overhead.</p>
+</li>
+</ol>
+<p>This section describes my <a
href="https://github.com/apache/arrow-rs/pull/6921#issuecomment-2718792433">&lt;700
LOC PR (with lots of comments and tests)</a> that <strong>reduces
total ClickBench time by 15%, with up to 2x lower latency for some queries, no
obvious regression on other queries, and caches at most 2 pages (~2MB) per
column in memory</strong>.</p>
+<figure>
+<img alt="New decoding pipeline, building filter mask and output columns
are interleaved in a single pass, allowing us to cache minimal pages for
minimal amount of time" class="img-responsive"
src="/blog/images/parquet-pushdown/new-pipeline.jpg" width="80%"/>
+<figcaption>
+ New decoding pipeline, building filter mask and output columns are
interleaved in a single pass, allowing us to cache minimal pages for minimal
amount of time
+ </figcaption>
+</figure>
+<p>The new pipeline interleaves the previous two phases into a single
pass, so that:</p>
+<ol>
+<li>
+<p>The page being decompressed is immediately used to build filter masks
and output columns.</p>
+</li>
+<li>
+<p>Decompressed pages are cached for minimal time; after one pass (steps
1-6), the cache memory is released for the next pass. </p>
+</li>
+</ol>
+<p>This allows the cache to only hold 1 page at a time, and to
immediately discard the previous page after it's used, significantly reducing
the memory requirement for caching.</p>
+<h3>What pages are cached?</h3>
+<p>You may have noticed that only <code>location</code> is
cached, not <code>val</code>, because <code>val</code>
is only used for output.
+More generally, only columns that appear both in the filter and output are
cached, and at most 1 page is cached for each such column.</p>
+<p>More examples:</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span class="n">val</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, no columns are cached, because
<code>val</code> is not used for filtering.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="k">SELECT</span> <span
class="k">COUNT</span><span class="p">(</span><span
class="o">*</span><span class="p">)</span>
+<span class="k">FROM</span> <span
class="n">sensor_data</span>
+<span class="k">WHERE</span> <span
class="n">date_time</span> <span class="o">&gt;</span>
<span class="s1">'2025-03-11'</span> <span
class="k">AND</span> <span class="k">location</span>
<span class="o">=</span> <span
class="s1">'office'</span><span class="p">;</span>
+</code></pre></div>
+<p>In this case, again, no columns are cached, because the output
projection is empty after query plan optimization.</p>
+<h3>Then why cache 2 pages per column instead of 1?</h3>
+<p>This is another real-world nuance regarding how Parquet layouts the
pages.</p>
+<p>Parquet by default encodes data using <a
href="https://parquet.apache.org/docs/file-format/data-pages/encodings/">dictionary
encoding</a>, which writes a dictionary page as the first page of a
column chunk, followed by the keys referencing the dictionary.</p>
+<p>You can see this in action using <a
href="https://parquet-viewer.xiangpeng.systems">parquet-viewer</a>:</p>
+<figure>
+<img alt="Parquet viewer shows the page layout of a column chunk"
class="img-responsive" src="/blog/images/parquet-pushdown/parquet-viewer.jpg"
width="80%"/>
+<figcaption>
+ Parquet viewer shows the page layout of a column chunk
+ </figcaption>
+</figure>
+<p>This means that to decode a page of data, it actually references two
pages: the dictionary page and the data page.</p>
+<p>This is why it caches 2 pages per column: one dictionary page and one
data page.
+The data page slot will move forward as it reads the data; but the dictionary
page slot always references the first page.</p>
+<figure>
+<img alt="Cached two pages, one for dictionary (pinned), one for data
(moves as it reads the data)" class="img-responsive"
src="/blog/images/parquet-pushdown/cached-pages.jpg" width="80%"/>
+<figcaption>
+ Cached two pages, one for dictionary (pinned), one for data (moves as it
reads the data)
+ </figcaption>
+</figure>
+<h2>How does it perform?</h2>
+<p>Here are my results on <a
href="https://github.com/apache/datafusion/tree/main/benchmarks#clickbench">ClickBench</a>
on my AMD 9900X machine. The total time is reduced by 15%, with Q23 being
2.24x faster,
+and queries that get slower are likely due to noise.</p>
+<div
class="codehilite"><pre><span></span><code><span
class="err">┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓</span>
+<span class="err">┃ Query ┃ no-pushdown ┃ new-pushdown ┃
Change ┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; QQuery 0 &boxv; 0.47ms
&boxv; 0.43ms &boxv; +1.10x faster &boxv;</span>
+<span class="err">&boxv; QQuery 1 &boxv; 51.10ms
&boxv; 50.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 2 &boxv; 68.23ms
&boxv; 64.49ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 3 &boxv; 90.68ms
&boxv; 86.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 4 &boxv; 458.93ms
&boxv; 458.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 5 &boxv; 522.06ms
&boxv; 478.50ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 6 &boxv; 49.84ms
&boxv; 49.94ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 7 &boxv; 55.09ms
&boxv; 55.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 8 &boxv; 565.26ms
&boxv; 556.95ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 9 &boxv; 575.83ms
&boxv; 575.05ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 10 &boxv; 164.56ms
&boxv; 178.23ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 11 &boxv; 177.20ms
&boxv; 191.32ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 12 &boxv; 591.05ms
&boxv; 569.92ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 13 &boxv; 861.06ms
&boxv; 848.59ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 14 &boxv; 596.20ms
&boxv; 580.73ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 15 &boxv; 554.96ms
&boxv; 548.77ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 16 &boxv; 1175.08ms
&boxv; 1146.07ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 17 &boxv; 1150.45ms
&boxv; 1121.49ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 18 &boxv; 2634.75ms
&boxv; 2494.07ms &boxv; +1.06x faster &boxv;</span>
+<span class="err">&boxv; QQuery 19 &boxv; 90.15ms
&boxv; 89.24ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 20 &boxv; 620.15ms
&boxv; 591.67ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 21 &boxv; 782.38ms
&boxv; 703.15ms &boxv; +1.11x faster &boxv;</span>
+<span class="err">&boxv; QQuery 22 &boxv; 1927.94ms
&boxv; 1404.35ms &boxv; +1.37x faster &boxv;</span>
+<span class="err">&boxv; QQuery 23 &boxv; 8104.11ms
&boxv; 3610.76ms &boxv; +2.24x faster &boxv;</span>
+<span class="err">&boxv; QQuery 24 &boxv; 360.79ms
&boxv; 330.55ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 25 &boxv; 290.61ms
&boxv; 252.54ms &boxv; +1.15x faster &boxv;</span>
+<span class="err">&boxv; QQuery 26 &boxv; 395.18ms
&boxv; 362.72ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 27 &boxv; 891.76ms
&boxv; 959.39ms &boxv; 1.08x slower &boxv;</span>
+<span class="err">&boxv; QQuery 28 &boxv; 4059.54ms
&boxv; 4137.37ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 29 &boxv; 235.88ms
&boxv; 228.99ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 30 &boxv; 564.22ms
&boxv; 584.65ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 31 &boxv; 741.20ms
&boxv; 757.87ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 32 &boxv; 2652.48ms
&boxv; 2574.19ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 33 &boxv; 2373.71ms
&boxv; 2327.10ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 34 &boxv; 2391.00ms
&boxv; 2342.15ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 35 &boxv; 700.79ms
&boxv; 694.51ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 36 &boxv; 151.51ms
&boxv; 152.93ms &boxv; no change &boxv;</span>
+<span class="err">&boxv; QQuery 37 &boxv; 108.18ms
&boxv; 86.03ms &boxv; +1.26x faster &boxv;</span>
+<span class="err">&boxv; QQuery 38 &boxv; 114.64ms
&boxv; 106.22ms &boxv; +1.08x faster &boxv;</span>
+<span class="err">&boxv; QQuery 39 &boxv; 260.80ms
&boxv; 239.13ms &boxv; +1.09x faster &boxv;</span>
+<span class="err">&boxv; QQuery 40 &boxv; 60.74ms
&boxv; 73.29ms &boxv; 1.21x slower &boxv;</span>
+<span class="err">&boxv; QQuery 41 &boxv; 58.75ms
&boxv; 67.85ms &boxv; 1.15x slower &boxv;</span>
+<span class="err">&boxv; QQuery 42 &boxv; 65.49ms
&boxv; 68.11ms &boxv; no change &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&
[...]
+<span
class="err">┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓</span>
+<span class="err">┃ Benchmark Summary ┃
┃</span>
+<span
class="err">┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩</span>
+<span class="err">&boxv; Total Time (no-pushdown) &boxv;
38344.79ms &boxv;</span>
+<span class="err">&boxv; Total Time (new-pushdown) &boxv;
32800.50ms &boxv;</span>
+<span class="err">&boxv; Average Time (no-pushdown) &boxv;
891.74ms &boxv;</span>
+<span class="err">&boxv; Average Time (new-pushdown) &boxv;
762.80ms &boxv;</span>
+<span class="err">&boxv; Queries Faster &boxv;
13 &boxv;</span>
+<span class="err">&boxv; Queries Slower &boxv;
5 &boxv;</span>
+<span class="err">&boxv; Queries with No Change &boxv;
25 &boxv;</span>
+<span
class="err">&boxur;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxhu;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxh;&boxul;</span>
+</code></pre></div>
+<h2>Conclusion</h2>
+<p>Despite being simple in theory, filter pushdown in Parquet is
non-trivial to implement.
+It requires understanding both the Parquet format and reader implementation
details.
+The challenge lies in efficiently navigating through the dynamics of decoding,
filter evaluation, and memory management.</p>
+<p>If you are interested in this level of optimization and want to help
test, document and implement this type of optimization, come find us in the
<a
href="https://datafusion.apache.org/contributor-guide/communication.html">DataFusion
Community</a>. We would love to have you. </p></content><category
term="blog"></category></entry><entry><title>Parquet Pruning in DataFusion:
Read Only What Matters</title><link
href="https://datafusion.apache.org/blog/2025/03/20/pa [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/feeds/xiangpeng-hao.rss.xml
b/output/feeds/xiangpeng-hao.rss.xml
index d48cdfa..bfe62fb 100644
--- a/output/feeds/xiangpeng-hao.rss.xml
+++ b/output/feeds/xiangpeng-hao.rss.xml
@@ -1,5 +1,43 @@
<?xml version="1.0" encoding="utf-8"?>
-<rss version="2.0"><channel><title>Apache DataFusion Blog - Xiangpeng
Hao</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Thu,
20 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Parquet Pruning in
DataFusion: Read Only What
Matters</title><link>https://datafusion.apache.org/blog/2025/03/20/parquet-pruning</link><description><!--
+<rss version="2.0"><channel><title>Apache DataFusion Blog - Xiangpeng
Hao</title><link>https://datafusion.apache.org/blog/</link><description></description><lastBuildDate>Fri,
21 Mar 2025 00:00:00 +0000</lastBuildDate><item><title>Efficient Filter
Pushdown in
Parquet</title><link>https://datafusion.apache.org/blog/2025/03/21/parquet-pushdown</link><description><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng
Hao's blog</a>. Thanks to <a
href="https://www.influxdata.com/">InfluxData</a> for sponsoring this
work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></description><dc:creator
xmlns:dc="http://purl.org/dc/elements/1.1/">Xiangpeng
Hao</dc:creator><pubDate>Fri, 21 Mar 2025 00:00:00 +0000</pubDate><guid
isPermaLink="false">tag:datafusion.apache.org,2025-03-21:/blog/2025/03/21/parquet-pushdown</guid><category>blog</category></item><item><title>Parquet
Pruning in DataFusion: Read Only What Matters</title><link>ht [...]
{% comment %}
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
diff --git a/output/images/parquet-pushdown/baseline-impl.jpg
b/output/images/parquet-pushdown/baseline-impl.jpg
new file mode 100644
index 0000000..28f7e59
Binary files /dev/null and b/output/images/parquet-pushdown/baseline-impl.jpg
differ
diff --git a/output/images/parquet-pushdown/cached-pages.jpg
b/output/images/parquet-pushdown/cached-pages.jpg
new file mode 100644
index 0000000..0949544
Binary files /dev/null and b/output/images/parquet-pushdown/cached-pages.jpg
differ
diff --git a/output/images/parquet-pushdown/new-pipeline.jpg
b/output/images/parquet-pushdown/new-pipeline.jpg
new file mode 100644
index 0000000..93db49b
Binary files /dev/null and b/output/images/parquet-pushdown/new-pipeline.jpg
differ
diff --git a/output/images/parquet-pushdown/parquet-viewer.jpg
b/output/images/parquet-pushdown/parquet-viewer.jpg
new file mode 100644
index 0000000..71de38d
Binary files /dev/null and b/output/images/parquet-pushdown/parquet-viewer.jpg
differ
diff --git a/output/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg
b/output/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg
new file mode 100644
index 0000000..bedcf10
Binary files /dev/null and
b/output/images/parquet-pushdown/pushdown-vs-no-pushdown.jpg differ
diff --git a/output/index.html b/output/index.html
index 9fecc28..ce50f7d 100644
--- a/output/index.html
+++ b/output/index.html
@@ -44,6 +44,63 @@
<p><i>Here you can find the latest updates from DataFusion and
related projects.</i></p>
+ <!-- Post -->
+ <div class="row">
+ <div class="callout">
+ <article class="post">
+ <header>
+ <div class="title">
+ <h1><a
href="/blog/2025/03/21/parquet-pushdown">Efficient Filter Pushdown in
Parquet</a></h1>
+ <p>Posted on: Fri 21 March 2025 by Xiangpeng Hao</p>
+ <p><style>
+figure {
+ margin: 20px 0;
+}
+
+figure img {
+ display: block;
+ max-width: 80%;
+}
+
+figcaption {
+ font-style: italic;
+ margin-top: 10px;
+ color: #555;
+ font-size: 0.9em;
+ max-width: 80%;
+}
+</style>
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+<p><em>Editor's Note: This blog was first published on <a
href="https://blog.xiangpeng.systems/posts/parquet-pushdown/">Xiangpeng Hao's
blog</a>. Thanks to <a href="https://www.influxdata.com/">InfluxData</a> for
sponsoring this work as part of his PhD funding.</em></p>
+<hr/>
+<p>In the <a
href="https://datafusion.apache.org/blog/2025/03/20/parquet-pruning">previous
post …</a></p></p>
+ <footer>
+ <ul class="actions">
+ <div style="text-align: right"><a
href="/blog/2025/03/21/parquet-pushdown" class="button medium">Continue
Reading</a></div>
+ </ul>
+ <ul class="stats">
+ </ul>
+ </footer>
+ </article>
+ </div>
+ </div>
<!-- Post -->
<div class="row">
<div class="callout">
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]