(datafusion-site) branch asf-site updated: Publish Apache DataFusion is now the fastest single node engine for querying Apache Parquet files (#39)

alamb Thu, 21 Nov 2024 10:45:35 -0800

This is an automated email from the ASF dual-hosted git repository.

alamb pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 7aafb14  Publish Apache DataFusion is now the fastest single node 
engine for querying Apache Parquet files (#39)
7aafb14 is described below

commit 7aafb142744f5add2cd33d78b50d223277f9ab26
Author: Andrew Lamb <[email protected]>
AuthorDate: Thu Nov 21 13:20:20 2024 -0500

    Publish Apache DataFusion is now the fastest single node engine for 
querying Apache Parquet files (#39)
---
 .../index.html                                     | 336 +++++++++++++++++++++
 README.md                                          |   8 +-
 assets/main.css.map                                |   2 +-
 feed.xml                                           | 310 +++++++++++++++----
 .../column-based-storage.png                       | Bin 0 -> 47390 bytes
 img/clickbench-datafusion-43/perf-over-time.png    | Bin 0 -> 62185 bytes
 img/clickbench-datafusion-43/perf.png              | Bin 0 -> 57326 bytes
 img/clickbench-datafusion-43/row-based-storage.png | Bin 0 -> 45024 bytes
 .../skipping-partial-aggregation.png               | Bin 0 -> 133362 bytes
 img/clickbench-datafusion-43/string-view-take.png  | Bin 0 -> 66566 bytes
 index.html                                         |   5 +
 11 files changed, 604 insertions(+), 57 deletions(-)

diff --git 
a/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html 
b/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html
new file mode 100644
index 0000000..5340bd5
--- /dev/null
+++ b/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html
@@ -0,0 +1,336 @@
+<!DOCTYPE html>
+<html lang="en"><head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1"><!-- 
Begin Jekyll SEO tag v2.8.0 -->
+<title>Apache DataFusion is now the fastest single node engine for querying 
Apache Parquet files | Apache DataFusion Project News &amp; Blog</title>
+<meta name="generator" content="Jekyll v4.3.3" />
+<meta property="og:title" content="Apache DataFusion is now the fastest single 
node engine for querying Apache Parquet files" />
+<meta name="author" content="Andrew Lamb, Staff Engineer at InfluxData" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="&lt;!–" />
+<meta property="og:description" content="&lt;!–" />
+<link rel="canonical" 
href="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/";
 />
+<meta property="og:url" 
content="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/";
 />
+<meta property="og:site_name" content="Apache DataFusion Project News &amp; 
Blog" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2024-11-18T00:00:00+00:00" />
+<meta name="twitter:card" content="summary" />
+<meta property="twitter:title" content="Apache DataFusion is now the fastest 
single node engine for querying Apache Parquet files" />
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"Andrew
 Lamb, Staff Engineer at 
InfluxData"},"dateModified":"2024-11-18T00:00:00+00:00","datePublished":"2024-11-18T00:00:00+00:00","description":"&lt;!–","headline":"Apache
 DataFusion is now the fastest single node engine for querying Apache Parquet 
files","mainEntityOfPage":{"@type":"WebPage","@id":"https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"},";
 [...]
+<!-- End Jekyll SEO tag -->
+<link rel="stylesheet" href="/blog/assets/main.css"><link 
type="application/atom+xml" rel="alternate" 
href="https://datafusion.apache.org/blog/feed.xml"; title="Apache DataFusion 
Project News &amp; Blog" /></head>
+<body><header class="site-header" role="banner">
+
+  <div class="wrapper"><a class="site-title" rel="author" href="/blog/">Apache 
DataFusion Project News &amp; Blog</a><nav class="site-nav">
+        <input type="checkbox" id="nav-trigger" class="nav-trigger" />
+        <label for="nav-trigger">
+          <span class="menu-icon">
+            <svg viewBox="0 0 18 15" width="18px" height="15px">
+              <path 
d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0
 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z 
M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 
c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z 
M18,13.516C18,14.335,17.335,15,16.516,15H1.484 
C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
+            </svg>
+          </span>
+        </label>
+
+        <div class="trigger"><a class="page-link" 
href="/blog/about/">About</a></div>
+      </nav></div>
+</header>
+<main class="page-content" aria-label="Content">
+      <div class="wrapper">
+        <article class="post h-entry" itemscope 
itemtype="http://schema.org/BlogPosting";>
+
+  <header class="post-header">
+    <h1 class="post-title p-name" itemprop="name headline">Apache DataFusion 
is now the fastest single node engine for querying Apache Parquet files</h1>
+    <p class="post-meta">
+      <time class="dt-published" datetime="2024-11-18T00:00:00+00:00" 
itemprop="datePublished">Nov 18, 2024
+      </time>• <span itemprop="author" itemscope 
itemtype="http://schema.org/Person";><span class="p-author h-card" 
itemprop="name">Andrew Lamb, Staff Engineer at InfluxData</span></span></p>
+  </header>
+
+  <div class="post-content e-content" itemprop="articleBody">
+    <!--
+
+-->
+
+<p>I am extremely excited to announce that <a 
href="https://crates.io/crates/datafusion";>Apache DataFusion</a>  is the
+fastest engine for querying Apache Parquet files in <a 
href="https://benchmark.clickhouse.com/";>ClickBench</a>. It is faster
+than <a href="https://duckdb.org/";>DuckDB</a>, <a 
href="https://clickhouse.com/chdb";>chDB</a> and <a 
href="https://clickhouse.com/";>Clickhouse</a> using the same hardware. It also 
marks
+the first time a <a href="https://www.rust-lang.org/";>Rust</a>-based engine 
holds the top spot, which has previously
+been held by traditional C/C++-based engines.</p>
+
+<p><img src="/blog/img/2x_bgwhite_original.png" width="80%" 
class="img-responsive" alt="Apache DataFusion Logo" /></p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf.png" width="100%" 
class="img-responsive" alt="ClickBench performance for DataFusion 43.0.0" /></p>
+
+<p><strong>Figure 1</strong>: 2024-11-16 <a 
href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6dHJ1ZSwiY2hEQiI6ZmFsc2UsIkNpdHVzIjpmYWxzZSwiQ2xpY2tIb3VzZSBDbG91ZCAoYXdzKSI6
 [...]
+partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a <code 
class="language-plaintext highlighter-rouge">c6a.4xlarge</code> (16
+CPU / 32 GB  RAM) VM. Measurements are relative (<code 
class="language-plaintext highlighter-rouge">1.x</code>) to results using
+different hardware.</p>
+
+<p>Best in class performance on Parquet is now available to anyone. 
DataFusion’s
+open design lets you start quickly with a full featured Query Engine, including
+SQL, data formats, catalogs, and more, and then customize any behavior you 
need.
+I predict the continued emergence of new classes of data systems now that
+creators can focus the bulk of their innovation on areas such as query
+languages, system integrations, and data formats rather than trying to play
+catchup with core engine performance.</p>
+
+<p>ClickBench also includes results for proprietary storage formats, which 
require
+costly load / export steps, making them useful in fewer use cases and thus much
+less important than open formats (though the idea of use case specific formats
+is interesting<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" 
class="footnote" rel="footnote">2</a></sup>).</p>
+
+<p>This blog post highlights some of the techniques we used to achieve this
+performance, and celebrates the teamwork involved.</p>
+
+<h1 id="a-strong-history-of-performance-improvements">A Strong History of 
Performance Improvements</h1>
+
+<p>Performance has long been a core focus for DataFusion’s community, and 
+speed attracts users and contributors. Recently, we seem to have been
+even more focused on performance, including in July, 2024 when <a 
href="https://www.linkedin.com/in/mehmet-ozan-kabak/";>Mehmet Ozan
+Kabak</a>, CEO of <a href="https://www.synnada.ai/";>Synnada</a>, again <a 
href="https://github.com/apache/datafusion/issues/11442#issuecomment-2226834443";>suggested
 focusing on performance</a>. This
+got many of us excited (who doesn’t love a challenge!), and we have 
subsequently
+rallied to steadily improve the performance release on release as shown in
+Figure 2.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf-over-time.png" 
width="100%" class="img-responsive" alt="ClickBench performance results over 
time for DataFusion" /></p>
+
+<p><strong>Figure 2</strong>: ClickBench performance improved over 30% between 
DataFusion 34
+(released Dec. 2023) and DataFusion 43 (released Nov. 2024).</p>
+
+<p>Like all good optimization efforts, ours took sustained effort as 
DataFusion ran
+out of <a 
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion";>single
 2x performance improvements</a> several years ago. Working together our
+community of engineers from around the world<sup id="fnref:3" 
role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> 
and all experience levels<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" 
class="footnote" rel="footnote">4</a></sup>
+pulled it off (check out <a 
href="https://github.com/apache/datafusion/issues/12821";>this discussion</a> to 
get a sense). It may be a “<a href="https://db.cs.cmu.edu/seminar2024/";>hobo
+sandwich</a>” <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" 
class="footnote" rel="footnote">5</a></sup>, but it is a tasty one!</p>
+
+<p>Of course, most of these techniques have been implemented and described 
before,
+but until now they were only available in proprietary systems such as
+<a href="https://www.vertica.com/";>Vertica</a>, <a 
href="https://www.databricks.com/product/photon";>DataBricks
+Photon</a>, or
+<a href="https://www.snowflake.com/en/";>Snowflake</a> or in tightly integrated 
open source
+systems such as <a href="https://duckdb.org/";>DuckDB</a> or
+<a href="https://clickhouse.com/";>ClickHouse</a> which were not designed to be 
extended.</p>
+
+<h2 id="stringview">StringView</h2>
+
+<p>Performance improved for all queries when DataFusion switched to using Arrow
+<code class="language-plaintext highlighter-rouge">StringView</code>. Using 
<code class="language-plaintext highlighter-rouge">StringView</code> “just” 
saves some copies and avoids one memory
+access for certain comparisons. However, these copies and comparisons happen to
+occur in many of the hottest loops during query processing, so optimizing them
+resulted in measurable performance improvements.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/string-view-take.png" 
width="80%" class="img-responsive" alt="Illustration of how take works with 
StringView" /></p>
+
+<p><strong>Figure 3:</strong> Figure from <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/";>Using
 StringView / German Style Strings to Make
+Queries Faster: Part 1</a> showing how <code class="language-plaintext 
highlighter-rouge">StringView</code> saves copying data in many cases.</p>
+
+<p>Using StringView to make DataFusion faster for ClickBench required 
substantial
+careful, low level optimization work described in <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/";>Using
 StringView / German
+Style Strings to Make Queries Faster: Part 1</a> and <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/";>Part
 2</a>. However, it <em>also</em>
+required extending the rest of DataFusion’s operations to support the new type.
+You can get a sense of the magnitude of the work required by looking at the 
100+
+pull requests linked to the epic in arrow-rs
+(<a href="https://github.com/apache/arrow-rs/issues/5374";>here</a>) and three 
major epics
+(<a href="https://github.com/apache/datafusion/issues/10918";>here</a>,
+<a href="https://github.com/apache/datafusion/issues/11790";>here</a> and
+<a href="https://github.com/apache/datafusion/issues/11752";>here</a>) in 
DataFusion.</p>
+
+<p>Here is a partial list of people involved in the project (I am sorry to 
those whom I forgot)</p>
+
+<ul>
+  <li><strong>Arrow</strong>:  <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a> (InfluxData’s amazing 
2024 summer intern and UW Madison PhD), <a 
href="https://github.com/ariesdevil";>Yijun Zhao</a> from DataBend Labs, and <a 
href="https://github.com/tustvold";>Raphael Taylor-Davies</a> laid the 
foundation.  <a href="https://github.com/RinChanNOWWW";>RinChanNOW</a> from 
Tencent and <a href="https://github.com/a10y";>Andrew Duffy</a> from SpiralDB 
helped push it along in the early d [...]
+  <li><strong>DataFusion</strong>:  <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a>, again charted the 
initial path and <a href="https://github.com/Weijun-H";>Weijun Huang</a>, <a 
href="https://github.com/dharanad";>Dharan Aditya</a> <a 
href="https://github.com/Lordworms";>Lordworms</a>, <a 
href="https://github.com/goldmedal";>Jax Liu</a>,  <a 
href="https://github.com/wiedld";>wiedld</a>, <a 
href="https://github.com/tlm365";>Tai Le Manh</a>, <a 
href="https://github.com/my-vegetable [...]
+  <li><strong>DataFusion String Function Migration</strong>:  <a 
href="https://github.com/tshauck";>Trent Hauck</a> organized the effort and set 
the patterns, <a href="https://github.com/goldmedal";>Jax Liu</a> made a clever 
testing framework, and <a href="https://github.com/austin362667";>Austin 
Liu</a>, <a href="https://github.com/demetribu";>Dmitrii Bu</a>, <a 
href="https://github.com/tlm365";>Tai Le Manh</a>, <a 
href="https://github.com/PsiACE";>Chojan Shang</a>, <a href="https://github.co 
[...]
+</ul>
+
+<h2 id="parquet">Parquet</h2>
+
+<p>Part of the reason for DataFusion’s speed in ClickBench is reading Parquet 
files (really) quickly,
+which reflects invested effort in the Parquet reading system (see <a 
href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/";>Querying
+Parquet with Millisecond Latency</a> )</p>
+
+<p>The <a 
href="https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html";>DataFusion
 ParquetExec</a> (built on the <a href="https://crates.io/crates/parquet";>Rust 
Parquet Implementation</a>) is now the most
+sophisticated open source Parquet reader I know of. It has every optimization 
we
+can think of for reading Parquet, including projection pushdown, predicate
+pushdown (row group metadata, page index, and bloom filters), limit pushdown,
+parallel reading, interleaved I/O, and late materialized filtering (coming 
soon ™️
+by default). Some recent work from <a 
href="https://github.com/itsjunetime";>June</a>
+<a href="https://github.com/apache/datafusion/pull/12135";>recently unblocked a 
remaining hurdle</a> for enabling late materialized
+filtering, and conveniently <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a> is
+working on the <a 
href="https://github.com/apache/arrow-datafusion/issues/3463";>final piece</a> 
(no pressure😅)</p>
+
+<h2 id="skipping-partial-aggregation-when-it-doesnt-help">Skipping Partial 
Aggregation When It Doesn’t Help</h2>
+
+<p>Many ClickBench queries are aggregations that summarize millions of rows, a
+common task for reporting and dashboarding. DataFusion uses state of the art
+<a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state";>two
 phase aggregation</a> plans. Normally, two phase aggregation works well as the
+first phase consolidates many rows immediately after reading, while the data is
+still in cache. However, for certain “high cardinality” aggregate queries (that
+have large numbers of groups), <a 
href="https://github.com/apache/datafusion/issues/6937";>the two phase 
aggregation strategy used in
+DataFusion was inefficient</a>,
+manifesting in relatively slower performance compared to other engines for
+ClickBench queries such as</p>
+
+<div class="language-sql highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="k">SELECT</span> <span 
class="nv">"WatchID"</span><span class="p">,</span> <span 
class="nv">"ClientIP"</span><span class="p">,</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span> <span class="k">AS</span> <span class="k">c</span><span 
class="p">,</span> <span class="p">...</span> 
+<span class="k">FROM</span> <span class="n">hits</span> 
+<span class="k">GROUP</span> <span class="k">BY</span> <span 
class="nv">"WatchID"</span><span class="p">,</span> <span 
class="nv">"ClientIP"</span> <span class="cm">/* &lt;----- 13M Distinct 
Groups!!! */</span>
+<span class="k">ORDER</span> <span class="k">BY</span> <span 
class="k">c</span> <span class="k">DESC</span> 
+<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
+</code></pre></div></div>
+
+<p>For such queries, the first aggregation phase does not significantly
+reduce the number of rows, which wastes significant effort. <a 
href="https://github.com/korowa";>Eduard
+Karacharov</a> contributed a <a 
href="https://github.com/apache/datafusion/pull/11627";>dynamic strategy</a> to
+bypass the first phase when it is not working efficiently, shown in Figure 
4.</p>
+
+<p><img 
src="/blog/img/clickbench-datafusion-43/skipping-partial-aggregation.png" 
width="100%" class="img-responsive" alt="Two phase aggregation diagram from 
DataFusion API docs annotated to show first phase not helping" /></p>
+
+<p><strong>Figure 4</strong>: Diagram from <a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state";>DataFusion
 API docs</a> showing when the multi-phase
+grouping is not effective</p>
+
+<h2 id="optimized-multi-column-grouping">Optimized Multi-Column Grouping</h2>
+
+<p>Another method for improving analytic database performance is specialized 
(aka
+highly optimized) versions of operations for different data types, which the
+system picks at runtime based on the query. Like other systems, DataFusion has
+specialized code for handling different types of group columns. For example,
+there is <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs";>special
 code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP 
BY int_id</code>  and <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/bytes.rs";>different
 special
+code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP 
BY string_id</code> .</p>
+
+<p>When a query groups by multiple columns, it is tricker to apply this 
technique.
+For example <code class="language-plaintext highlighter-rouge">GROUP BY 
string_id, int_id</code> and <code class="language-plaintext 
highlighter-rouge">GROUP BY int_id, string_id</code> have
+different optimal structures, but it is not possible to include specialized
+versions for all possible combinations of group column types.</p>
+
+<p>DataFusion includes <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/row.rs#L33-L39";>a
 general Row based mechanism</a> that works for any
+combination of column types, but this general mechanism copies each value twice
+as shown in Figure 5. The cost of this copy <a 
href="https://github.com/apache/datafusion/issues/9403";>is especially high for 
variable
+length strings and binary data</a>.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/row-based-storage.png" 
width="100%" class="img-responsive" alt="Row based storage for multiple group 
columns" /></p>
+
+<p><strong>Figure 5</strong>: Prior to DataFusion 43.0.0, queries with 
multiple group columns
+used Row based group storage and copied each group value twice. This copy
+consumes a substantial amount of the query time for queries with many distinct
+groups, such as several of the queries in ClickBench.</p>
+
+<p>Many optimizations in Databases boil down to simply avoiding copies, and 
this
+was no exception. The trick was to figure out how to avoid copies without
+causing per-column comparison overhead to dominate or complexity to get out of
+hand. In a great example of diligent and disciplined engineering, <a 
href="https://github.com/jayzhan211";>Jay
+Zhan</a> tried <a 
href="https://github.com/apache/datafusion/pull/10937";>several</a>, <a 
href="https://github.com/apache/datafusion/pull/10976";>different</a> approaches 
until arriving
+at the <a href="https://github.com/apache/datafusion/pull/12269";>one shipped 
in DataFusion <code class="language-plaintext 
highlighter-rouge">43.0.0</code></a>, shown in Figure 6.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/column-based-storage.png" 
width="100%" class="img-responsive" alt="Column based storage for multiple 
group columns" /></p>
+
+<p><strong>Figure 6</strong>: DataFusion 43.0.0’s new columnar group storage 
copies each group
+value exactly once, which is significantly faster when grouping by multiple
+columns.</p>
+
+<p>Huge thanks as well to <a href="https://github.com/eejbyfeldt";>Emil 
Ejbyfeldt</a> and
+<a href="https://github.com/Dandandan";>Daniël Heres</a> for their help 
reviewing and to
+<a href="https://github.com/Rachelint";>Rachelint (kamille</a>) for reviewing 
and
+contributing a faster <a 
href="https://github.com/apache/datafusion/pull/12996";>vectorized append and 
compare for multiple groups</a> which
+will be released in DataFusion 44. The discussion on <a 
href="https://github.com/apache/datafusion/issues/9403";>the ticket</a> is 
another
+great example of the power of the DataFusion community working together to 
build
+great software.</p>
+
+<h1 id="whats-next-">What’s Next 🚀</h1>
+
+<p>Just as I expect the performance of other engines to improve, DataFusion has
+several more performance improvements lined up itself:</p>
+
+<ol>
+  <li><a 
href="https://github.com/apache/datafusion/pull/11943#top";>Intermediate results 
blocked management</a> (thanks again <a 
href="https://github.com/Rachelint";>Rachelint (kamille</a>)</li>
+  <li><a href="https://github.com/apache/datafusion/issues/3463";>Enable 
parquet filter pushdown by default</a></li>
+</ol>
+
+<p>We are also talking about what to focus on over the <a 
href="https://github.com/apache/datafusion/issues/13274";>next three
+months</a> and are always
+looking for people to help! If you want to geek out (obsess??) about 
performance
+and other features with engineers from around the world, <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>we 
would love you to
+join us</a>.</p>
+
+<h1 id="additional-thanks">Additional Thanks</h1>
+
+<p>In addition to the people called out above, thanks:</p>
+
+<ol>
+  <li><a href="https://github.com/pmcgleenon";>Patrick McGleenon</a> for 
running ClickBench and gathering this data (<a 
href="https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793";>source</a>).</li>
+  <li>Everyone I missed in the shoutouts – there are so many of you. We 
appreciate everyone.</li>
+</ol>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>I have dreamed about DataFusion being on top of the ClickBench leaderboard 
for
+several years. I often watched with envy improvements in systems backed by 
large
+VC investments, internet companies, or world class research institutions, and
+doubted that we could pull off something similar in an open source project with
+always limited time.</p>
+
+<p>The fact that we have now surpassed those other systems in query 
performance I
+think speaks to the power and possibility of focusing on community and aligning
+our collective enthusiasm and skills towards a common goal. Of course, being on
+the top in any particular benchmark is likely fleeting as other engines will
+improve, but so will DataFusion!</p>
+
+<p>I love working on DataFusion – the people, the quality of the code, my
+interactions and the results we have achieved together far surpass my
+expectations as well as most of my other software development experiences. I
+can’t wait to see what people will build next, and hope to <a 
href="https://github.com/apache/datafusion";>see you
+online</a>.</p>
+
+<h2 id="notes">Notes</h2>
+
+<div class="footnotes" role="doc-endnotes">
+  <ol>
+    <li id="fn:1" role="doc-endnote">
+      <p>Note that DuckDB is slightly faster on the ‘cold’ run. <a 
href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:2" role="doc-endnote">
+      <p>Want to try your hand at a custom format for ClickBench fame / 
glory?: <a href="https://github.com/apache/datafusion/issues/13448";>Make 
DataFusion the fastest engine in ClickBench with custom file format</a> <a 
href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:3" role="doc-endnote">
+      <p>We have contributors from North America, South American, Europe, 
Asia, Africa and Australia <a href="#fnref:3" class="reversefootnote" 
role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:4" role="doc-endnote">
+      <p>Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety 
experienced engineers <a href="#fnref:4" class="reversefootnote" 
role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:5" role="doc-endnote">
+      <p>Thanks to Andy Pavlo, I love that nomenclature <a href="#fnref:5" 
class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+  </ol>
+</div>
+
+  </div><a class="u-url" 
href="/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/" 
hidden></a>
+</article>
+
+      </div>
+    </main><footer class="site-footer h-card">
+  <data class="u-url" href="/blog/"></data>
+
+  <div class="wrapper">
+
+    <h2 class="footer-heading">Apache DataFusion Project News &amp; Blog</h2>
+
+    <div class="footer-col-wrapper">
+      <div class="footer-col footer-col-1">
+        <ul class="contact-list">
+          <li class="p-name">Apache DataFusion Project News &amp; 
Blog</li><li><a class="u-email" 
href="mailto:[email protected]";>[email protected]</a></li></ul>
+      </div>
+
+      <div class="footer-col footer-col-2"><ul 
class="social-media-list"><li><a 
href="https://www.twitter.com/ApacheDataFusio";><svg class="svg-icon"><use 
xlink:href="/blog/assets/minima-social-icons.svg#twitter"></use></svg> <span 
class="username">ApacheDataFusio</span></a></li></ul>
+</div>
+
+      <div class="footer-col footer-col-3">
+        <p>Apache DataFusion is a very fast, extensible query engine for 
building high-quality  data-centric systems in Rust, using the Apache Arrow 
in-memory format.</p>
+      </div>
+    </div>
+
+  </div>
+
+</footer>
+</body>
+
+</html>
diff --git a/README.md b/README.md
index 3e34ab1..17f4395 100644
--- a/README.md
+++ b/README.md
@@ -72,13 +72,17 @@ git checkout asf-site
 git pull
 # create a branch for the publishing
 git checkout -b publish_blog
-# push code upstream
-git push 
 # copy content built from _site directory
 cp -R ../datafusion-site/_site/* .
 git commit -a -m 'Publish blog content'
+# push code upstream
+git push 
 ```
 
 #### Make PR, targeting the `asf-site` branch
 For example, see https://github.com/apache/datafusion-site/pull/9
 
+#### Check site status
+
+The website is updated from the `asf-site` branch. You can check the status at 
+[ASF Infra sitesource](https://infra-reports.apache.org/#sitesource)
diff --git a/assets/main.css.map b/assets/main.css.map
index 4da063c..3dde519 100644
--- a/assets/main.css.map
+++ b/assets/main.css.map
@@ -1 +1 @@
-{"version":3,"sourceRoot":"","sources":["../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_base.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_layout.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EA
 [...]
\ No newline at end of file
+{"version":3,"sourceRoot":"","sources":["../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_base.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_layout.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EACA;EACA;;;AAKF;AAAA;
 [...]
\ No newline at end of file
diff --git a/feed.xml b/feed.xml
index d947f25..1da1efd 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.3.3">Jekyll</generator><link 
href="https://datafusion.apache.org/blog/feed.xml"; rel="self" 
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"; 
rel="alternate" type="text/html" 
/><updated>2024-11-20T21:23:41+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
 type="html">Apache DataFusion Project News &amp;amp;  [...]
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.3.3">Jekyll</generator><link 
href="https://datafusion.apache.org/blog/feed.xml"; rel="self" 
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"; 
rel="alternate" type="text/html" 
/><updated>2024-11-21T18:18:11+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
 type="html">Apache DataFusion Project News &amp;amp;  [...]
 
 -->
 
@@ -692,7 +692,260 @@ for their helpful reviews and feedback.</p>
 <p>Lastly, the Apache Arrow and DataFusion community is an active group of 
very helpful people working
 to make a great tool. If you want to get involved, please take a look at the
 <a href="https://datafusion.apache.org/python/";>online documentation</a> and 
jump in to help with one of the
-<a href="https://github.com/apache/datafusion-python/issues";>open 
issues</a>.</p>]]></content><author><name>timsaucer</name></author><category 
term="tutorial" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title 
type="html">Apache DataFusion Comet 0.3.0 Release</title><link 
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"; 
rel="alternate" type="text/html" title="Apache DataFusion Comet 0.3.0 Release" 
/><published>2024-09-27T00:00:00+00:00</p [...]
+<a href="https://github.com/apache/datafusion-python/issues";>open 
issues</a>.</p>]]></content><author><name>timsaucer</name></author><category 
term="tutorial" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title 
type="html">Apache DataFusion is now the fastest single node engine for 
querying Apache Parquet files</title><link 
href="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/";
 rel="alternate" type="text/html" title="A [...]
+
+-->
+
+<p>I am extremely excited to announce that <a 
href="https://crates.io/crates/datafusion";>Apache DataFusion</a>  is the
+fastest engine for querying Apache Parquet files in <a 
href="https://benchmark.clickhouse.com/";>ClickBench</a>. It is faster
+than <a href="https://duckdb.org/";>DuckDB</a>, <a 
href="https://clickhouse.com/chdb";>chDB</a> and <a 
href="https://clickhouse.com/";>Clickhouse</a> using the same hardware. It also 
marks
+the first time a <a href="https://www.rust-lang.org/";>Rust</a>-based engine 
holds the top spot, which has previously
+been held by traditional C/C++-based engines.</p>
+
+<p><img src="/blog/img/2x_bgwhite_original.png" width="80%" 
class="img-responsive" alt="Apache DataFusion Logo" /></p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf.png" width="100%" 
class="img-responsive" alt="ClickBench performance for DataFusion 43.0.0" /></p>
+
+<p><strong>Figure 1</strong>: 2024-11-16 <a 
href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6dHJ1ZSwiY2hEQiI6ZmFsc2UsIkNpdHVzIjpmYWxzZSwiQ2xpY2tIb3VzZSBDbG91ZCAoYXdzKSI6
 [...]
+partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a <code 
class="language-plaintext highlighter-rouge">c6a.4xlarge</code> (16
+CPU / 32 GB  RAM) VM. Measurements are relative (<code 
class="language-plaintext highlighter-rouge">1.x</code>) to results using
+different hardware.</p>
+
+<p>Best in class performance on Parquet is now available to anyone. 
DataFusion’s
+open design lets you start quickly with a full featured Query Engine, including
+SQL, data formats, catalogs, and more, and then customize any behavior you 
need.
+I predict the continued emergence of new classes of data systems now that
+creators can focus the bulk of their innovation on areas such as query
+languages, system integrations, and data formats rather than trying to play
+catchup with core engine performance.</p>
+
+<p>ClickBench also includes results for proprietary storage formats, which 
require
+costly load / export steps, making them useful in fewer use cases and thus much
+less important than open formats (though the idea of use case specific formats
+is interesting<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" 
class="footnote" rel="footnote">2</a></sup>).</p>
+
+<p>This blog post highlights some of the techniques we used to achieve this
+performance, and celebrates the teamwork involved.</p>
+
+<h1 id="a-strong-history-of-performance-improvements">A Strong History of 
Performance Improvements</h1>
+
+<p>Performance has long been a core focus for DataFusion’s community, and 
+speed attracts users and contributors. Recently, we seem to have been
+even more focused on performance, including in July, 2024 when <a 
href="https://www.linkedin.com/in/mehmet-ozan-kabak/";>Mehmet Ozan
+Kabak</a>, CEO of <a href="https://www.synnada.ai/";>Synnada</a>, again <a 
href="https://github.com/apache/datafusion/issues/11442#issuecomment-2226834443";>suggested
 focusing on performance</a>. This
+got many of us excited (who doesn’t love a challenge!), and we have 
subsequently
+rallied to steadily improve the performance release on release as shown in
+Figure 2.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf-over-time.png" 
width="100%" class="img-responsive" alt="ClickBench performance results over 
time for DataFusion" /></p>
+
+<p><strong>Figure 2</strong>: ClickBench performance improved over 30% between 
DataFusion 34
+(released Dec. 2023) and DataFusion 43 (released Nov. 2024).</p>
+
+<p>Like all good optimization efforts, ours took sustained effort as 
DataFusion ran
+out of <a 
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion";>single
 2x performance improvements</a> several years ago. Working together our
+community of engineers from around the world<sup id="fnref:3" 
role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> 
and all experience levels<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" 
class="footnote" rel="footnote">4</a></sup>
+pulled it off (check out <a 
href="https://github.com/apache/datafusion/issues/12821";>this discussion</a> to 
get a sense). It may be a “<a href="https://db.cs.cmu.edu/seminar2024/";>hobo
+sandwich</a>” <sup id="fnref:5" role="doc-noteref"><a href="#fn:5" 
class="footnote" rel="footnote">5</a></sup>, but it is a tasty one!</p>
+
+<p>Of course, most of these techniques have been implemented and described 
before,
+but until now they were only available in proprietary systems such as
+<a href="https://www.vertica.com/";>Vertica</a>, <a 
href="https://www.databricks.com/product/photon";>DataBricks
+Photon</a>, or
+<a href="https://www.snowflake.com/en/";>Snowflake</a> or in tightly integrated 
open source
+systems such as <a href="https://duckdb.org/";>DuckDB</a> or
+<a href="https://clickhouse.com/";>ClickHouse</a> which were not designed to be 
extended.</p>
+
+<h2 id="stringview">StringView</h2>
+
+<p>Performance improved for all queries when DataFusion switched to using Arrow
+<code class="language-plaintext highlighter-rouge">StringView</code>. Using 
<code class="language-plaintext highlighter-rouge">StringView</code> “just” 
saves some copies and avoids one memory
+access for certain comparisons. However, these copies and comparisons happen to
+occur in many of the hottest loops during query processing, so optimizing them
+resulted in measurable performance improvements.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/string-view-take.png" 
width="80%" class="img-responsive" alt="Illustration of how take works with 
StringView" /></p>
+
+<p><strong>Figure 3:</strong> Figure from <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/";>Using
 StringView / German Style Strings to Make
+Queries Faster: Part 1</a> showing how <code class="language-plaintext 
highlighter-rouge">StringView</code> saves copying data in many cases.</p>
+
+<p>Using StringView to make DataFusion faster for ClickBench required 
substantial
+careful, low level optimization work described in <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/";>Using
 StringView / German
+Style Strings to Make Queries Faster: Part 1</a> and <a 
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/";>Part
 2</a>. However, it <em>also</em>
+required extending the rest of DataFusion’s operations to support the new type.
+You can get a sense of the magnitude of the work required by looking at the 
100+
+pull requests linked to the epic in arrow-rs
+(<a href="https://github.com/apache/arrow-rs/issues/5374";>here</a>) and three 
major epics
+(<a href="https://github.com/apache/datafusion/issues/10918";>here</a>,
+<a href="https://github.com/apache/datafusion/issues/11790";>here</a> and
+<a href="https://github.com/apache/datafusion/issues/11752";>here</a>) in 
DataFusion.</p>
+
+<p>Here is a partial list of people involved in the project (I am sorry to 
those whom I forgot)</p>
+
+<ul>
+  <li><strong>Arrow</strong>:  <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a> (InfluxData’s amazing 
2024 summer intern and UW Madison PhD), <a 
href="https://github.com/ariesdevil";>Yijun Zhao</a> from DataBend Labs, and <a 
href="https://github.com/tustvold";>Raphael Taylor-Davies</a> laid the 
foundation.  <a href="https://github.com/RinChanNOWWW";>RinChanNOW</a> from 
Tencent and <a href="https://github.com/a10y";>Andrew Duffy</a> from SpiralDB 
helped push it along in the early d [...]
+  <li><strong>DataFusion</strong>:  <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a>, again charted the 
initial path and <a href="https://github.com/Weijun-H";>Weijun Huang</a>, <a 
href="https://github.com/dharanad";>Dharan Aditya</a> <a 
href="https://github.com/Lordworms";>Lordworms</a>, <a 
href="https://github.com/goldmedal";>Jax Liu</a>,  <a 
href="https://github.com/wiedld";>wiedld</a>, <a 
href="https://github.com/tlm365";>Tai Le Manh</a>, <a 
href="https://github.com/my-vegetable [...]
+  <li><strong>DataFusion String Function Migration</strong>:  <a 
href="https://github.com/tshauck";>Trent Hauck</a> organized the effort and set 
the patterns, <a href="https://github.com/goldmedal";>Jax Liu</a> made a clever 
testing framework, and <a href="https://github.com/austin362667";>Austin 
Liu</a>, <a href="https://github.com/demetribu";>Dmitrii Bu</a>, <a 
href="https://github.com/tlm365";>Tai Le Manh</a>, <a 
href="https://github.com/PsiACE";>Chojan Shang</a>, <a href="https://github.co 
[...]
+</ul>
+
+<h2 id="parquet">Parquet</h2>
+
+<p>Part of the reason for DataFusion’s speed in ClickBench is reading Parquet 
files (really) quickly,
+which reflects invested effort in the Parquet reading system (see <a 
href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/";>Querying
+Parquet with Millisecond Latency</a> )</p>
+
+<p>The <a 
href="https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html";>DataFusion
 ParquetExec</a> (built on the <a href="https://crates.io/crates/parquet";>Rust 
Parquet Implementation</a>) is now the most
+sophisticated open source Parquet reader I know of. It has every optimization 
we
+can think of for reading Parquet, including projection pushdown, predicate
+pushdown (row group metadata, page index, and bloom filters), limit pushdown,
+parallel reading, interleaved I/O, and late materialized filtering (coming 
soon ™️
+by default). Some recent work from <a 
href="https://github.com/itsjunetime";>June</a>
+<a href="https://github.com/apache/datafusion/pull/12135";>recently unblocked a 
remaining hurdle</a> for enabling late materialized
+filtering, and conveniently <a 
href="https://github.com/XiangpengHao";>Xiangpeng Hao</a> is
+working on the <a 
href="https://github.com/apache/arrow-datafusion/issues/3463";>final piece</a> 
(no pressure😅)</p>
+
+<h2 id="skipping-partial-aggregation-when-it-doesnt-help">Skipping Partial 
Aggregation When It Doesn’t Help</h2>
+
+<p>Many ClickBench queries are aggregations that summarize millions of rows, a
+common task for reporting and dashboarding. DataFusion uses state of the art
+<a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state";>two
 phase aggregation</a> plans. Normally, two phase aggregation works well as the
+first phase consolidates many rows immediately after reading, while the data is
+still in cache. However, for certain “high cardinality” aggregate queries (that
+have large numbers of groups), <a 
href="https://github.com/apache/datafusion/issues/6937";>the two phase 
aggregation strategy used in
+DataFusion was inefficient</a>,
+manifesting in relatively slower performance compared to other engines for
+ClickBench queries such as</p>
+
+<div class="language-sql highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="k">SELECT</span> <span 
class="nv">"WatchID"</span><span class="p">,</span> <span 
class="nv">"ClientIP"</span><span class="p">,</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span> <span class="k">AS</span> <span class="k">c</span><span 
class="p">,</span> <span class="p">...</span> 
+<span class="k">FROM</span> <span class="n">hits</span> 
+<span class="k">GROUP</span> <span class="k">BY</span> <span 
class="nv">"WatchID"</span><span class="p">,</span> <span 
class="nv">"ClientIP"</span> <span class="cm">/* &lt;----- 13M Distinct 
Groups!!! */</span>
+<span class="k">ORDER</span> <span class="k">BY</span> <span 
class="k">c</span> <span class="k">DESC</span> 
+<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
+</code></pre></div></div>
+
+<p>For such queries, the first aggregation phase does not significantly
+reduce the number of rows, which wastes significant effort. <a 
href="https://github.com/korowa";>Eduard
+Karacharov</a> contributed a <a 
href="https://github.com/apache/datafusion/pull/11627";>dynamic strategy</a> to
+bypass the first phase when it is not working efficiently, shown in Figure 
4.</p>
+
+<p><img 
src="/blog/img/clickbench-datafusion-43/skipping-partial-aggregation.png" 
width="100%" class="img-responsive" alt="Two phase aggregation diagram from 
DataFusion API docs annotated to show first phase not helping" /></p>
+
+<p><strong>Figure 4</strong>: Diagram from <a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state";>DataFusion
 API docs</a> showing when the multi-phase
+grouping is not effective</p>
+
+<h2 id="optimized-multi-column-grouping">Optimized Multi-Column Grouping</h2>
+
+<p>Another method for improving analytic database performance is specialized 
(aka
+highly optimized) versions of operations for different data types, which the
+system picks at runtime based on the query. Like other systems, DataFusion has
+specialized code for handling different types of group columns. For example,
+there is <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs";>special
 code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP 
BY int_id</code>  and <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/bytes.rs";>different
 special
+code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP 
BY string_id</code> .</p>
+
+<p>When a query groups by multiple columns, it is tricker to apply this 
technique.
+For example <code class="language-plaintext highlighter-rouge">GROUP BY 
string_id, int_id</code> and <code class="language-plaintext 
highlighter-rouge">GROUP BY int_id, string_id</code> have
+different optimal structures, but it is not possible to include specialized
+versions for all possible combinations of group column types.</p>
+
+<p>DataFusion includes <a 
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/row.rs#L33-L39";>a
 general Row based mechanism</a> that works for any
+combination of column types, but this general mechanism copies each value twice
+as shown in Figure 5. The cost of this copy <a 
href="https://github.com/apache/datafusion/issues/9403";>is especially high for 
variable
+length strings and binary data</a>.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/row-based-storage.png" 
width="100%" class="img-responsive" alt="Row based storage for multiple group 
columns" /></p>
+
+<p><strong>Figure 5</strong>: Prior to DataFusion 43.0.0, queries with 
multiple group columns
+used Row based group storage and copied each group value twice. This copy
+consumes a substantial amount of the query time for queries with many distinct
+groups, such as several of the queries in ClickBench.</p>
+
+<p>Many optimizations in Databases boil down to simply avoiding copies, and 
this
+was no exception. The trick was to figure out how to avoid copies without
+causing per-column comparison overhead to dominate or complexity to get out of
+hand. In a great example of diligent and disciplined engineering, <a 
href="https://github.com/jayzhan211";>Jay
+Zhan</a> tried <a 
href="https://github.com/apache/datafusion/pull/10937";>several</a>, <a 
href="https://github.com/apache/datafusion/pull/10976";>different</a> approaches 
until arriving
+at the <a href="https://github.com/apache/datafusion/pull/12269";>one shipped 
in DataFusion <code class="language-plaintext 
highlighter-rouge">43.0.0</code></a>, shown in Figure 6.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/column-based-storage.png" 
width="100%" class="img-responsive" alt="Column based storage for multiple 
group columns" /></p>
+
+<p><strong>Figure 6</strong>: DataFusion 43.0.0’s new columnar group storage 
copies each group
+value exactly once, which is significantly faster when grouping by multiple
+columns.</p>
+
+<p>Huge thanks as well to <a href="https://github.com/eejbyfeldt";>Emil 
Ejbyfeldt</a> and
+<a href="https://github.com/Dandandan";>Daniël Heres</a> for their help 
reviewing and to
+<a href="https://github.com/Rachelint";>Rachelint (kamille</a>) for reviewing 
and
+contributing a faster <a 
href="https://github.com/apache/datafusion/pull/12996";>vectorized append and 
compare for multiple groups</a> which
+will be released in DataFusion 44. The discussion on <a 
href="https://github.com/apache/datafusion/issues/9403";>the ticket</a> is 
another
+great example of the power of the DataFusion community working together to 
build
+great software.</p>
+
+<h1 id="whats-next-">What’s Next 🚀</h1>
+
+<p>Just as I expect the performance of other engines to improve, DataFusion has
+several more performance improvements lined up itself:</p>
+
+<ol>
+  <li><a 
href="https://github.com/apache/datafusion/pull/11943#top";>Intermediate results 
blocked management</a> (thanks again <a 
href="https://github.com/Rachelint";>Rachelint (kamille</a>)</li>
+  <li><a href="https://github.com/apache/datafusion/issues/3463";>Enable 
parquet filter pushdown by default</a></li>
+</ol>
+
+<p>We are also talking about what to focus on over the <a 
href="https://github.com/apache/datafusion/issues/13274";>next three
+months</a> and are always
+looking for people to help! If you want to geek out (obsess??) about 
performance
+and other features with engineers from around the world, <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>we 
would love you to
+join us</a>.</p>
+
+<h1 id="additional-thanks">Additional Thanks</h1>
+
+<p>In addition to the people called out above, thanks:</p>
+
+<ol>
+  <li><a href="https://github.com/pmcgleenon";>Patrick McGleenon</a> for 
running ClickBench and gathering this data (<a 
href="https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793";>source</a>).</li>
+  <li>Everyone I missed in the shoutouts – there are so many of you. We 
appreciate everyone.</li>
+</ol>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>I have dreamed about DataFusion being on top of the ClickBench leaderboard 
for
+several years. I often watched with envy improvements in systems backed by 
large
+VC investments, internet companies, or world class research institutions, and
+doubted that we could pull off something similar in an open source project with
+always limited time.</p>
+
+<p>The fact that we have now surpassed those other systems in query 
performance I
+think speaks to the power and possibility of focusing on community and aligning
+our collective enthusiasm and skills towards a common goal. Of course, being on
+the top in any particular benchmark is likely fleeting as other engines will
+improve, but so will DataFusion!</p>
+
+<p>I love working on DataFusion – the people, the quality of the code, my
+interactions and the results we have achieved together far surpass my
+expectations as well as most of my other software development experiences. I
+can’t wait to see what people will build next, and hope to <a 
href="https://github.com/apache/datafusion";>see you
+online</a>.</p>
+
+<h2 id="notes">Notes</h2>
+
+<div class="footnotes" role="doc-endnotes">
+  <ol>
+    <li id="fn:1" role="doc-endnote">
+      <p>Note that DuckDB is slightly faster on the ‘cold’ run. <a 
href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:2" role="doc-endnote">
+      <p>Want to try your hand at a custom format for ClickBench fame / 
glory?: <a href="https://github.com/apache/datafusion/issues/13448";>Make 
DataFusion the fastest engine in ClickBench with custom file format</a> <a 
href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:3" role="doc-endnote">
+      <p>We have contributors from North America, South American, Europe, 
Asia, Africa and Australia <a href="#fnref:3" class="reversefootnote" 
role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:4" role="doc-endnote">
+      <p>Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety 
experienced engineers <a href="#fnref:4" class="reversefootnote" 
role="doc-backlink">&#8617;</a></p>
+    </li>
+    <li id="fn:5" role="doc-endnote">
+      <p>Thanks to Andy Pavlo, I love that nomenclature <a href="#fnref:5" 
class="reversefootnote" role="doc-backlink">&#8617;</a></p>
+    </li>
+  </ol>
+</div>]]></content><author><name>Andrew Lamb, Staff Engineer at 
InfluxData</name></author><category term="core" /><category term="performance" 
/><summary type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title 
type="html">Apache DataFusion Comet 0.3.0 Release</title><link 
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"; 
rel="alternate" type="text/html" title="Apache DataFusion Comet 0.3.0 Release" 
/><published>2024-09-27T00:00:00+00:00</published><update [...]
 
 -->
 
@@ -1780,55 +2033,4 @@ recordings of previous calls.</p>
 performance regressions that you find. See the <a 
href="https://datafusion.apache.org/comet/user-guide/installation.html";>Getting 
Started</a> guide for instructions on downloading and installing
 Comet.</p>
 
-<p>There are also many <a 
href="https://github.com/apache/datafusion-comet/contribute";>good first 
issues</a> waiting for 
contributions.</p>]]></content><author><name>pmc</name></author><category 
term="subprojects" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title 
type="html">Announcing Apache Arrow DataFusion is now Apache 
DataFusion</title><link 
href="https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/"; 
rel="alternate" type="text/html" title="Announcing  [...]
-
--->
-
-<h2 id="introduction">Introduction</h2>
-
-<p>TLDR; <a href="https://arrow.apache.org/";>Apache Arrow</a> DataFusion –&gt; 
<a href="https://datafusion.apache.org/";>Apache DataFusion</a></p>
-
-<p>The Arrow PMC and newly created DataFusion PMC are happy to announce that 
as of
-April 16, 2024 the Apache Arrow DataFusion subproject is now a top level
-<a href="https://www.apache.org/";>Apache Software Foundation</a> project.</p>
-
-<h2 id="background">Background</h2>
-
-<p>Apache DataFusion is a fast, extensible query engine for building 
high-quality
-data-centric systems in Rust, using the Apache Arrow in-memory format.</p>
-
-<p>When DataFusion was <a 
href="https://arrow.apache.org/blog/2019/02/04/datafusion-donation/";>donated to 
the Apache Software Foundation</a> in 2019, the
-DataFusion community was not large enough to stand on its own and the Arrow
-project agreed to help support it. The community has grown significantly since
-2019, benefiting immensely from being part of Arrow and following <a 
href="https://www.apache.org/theapacheway/";>The Apache
-Way</a>.</p>
-
-<h2 id="why-now">Why now?</h2>
-
-<p>The community <a 
href="https://github.com/apache/datafusion/discussions/6475";>discussed 
graduating to a top level project publicly</a> for almost
-a year, as the project seemed ready to stand on its own and would benefit from
-more focused governance. For example, earlier in DataFusion’s life many
-contributed to both <a href="https://github.com/apache/arrow-rs";>arrow-rs</a> 
and DataFusion, but as DataFusion has matured many
-contributors, committers and PMC members focused more and more exclusively on
-DataFusion.</p>
-
-<h2 id="looking-forward">Looking forward</h2>
-
-<p>The future looks bright. There are now <a 
href="https://datafusion.apache.org/user-guide/introduction.html#known-users";>10s
 of known projects built with
-DataFusion</a>, and that number continues to grow. We recently held our <a 
href="https://github.com/apache/datafusion/discussions/8522";>first in
-person meetup</a> passed <a 
href="https://github.com/apache/datafusion/stargazers";>5000 stars</a> on 
GitHub, <a 
href="https://github.com/apache/datafusion/issues/8373#issuecomment-2025133714";>wrote
 a paper that was accepted
-at SIGMOD 2024</a>, and began work on <a 
href="https://github.com/apache/datafusion-comet";>Comet</a>, an <a 
href="https://spark.apache.org/";>Apache Spark</a> accelerator
-<a href="https://arrow.apache.org/blog/2024/03/06/comet-donation/";>initially 
donated by Apple</a>.</p>
-
-<p>Thank you to everyone in the Arrow community who helped DataFusion grow and
-mature over the years, and we look forward to continuing our collaboration as
-projects. All future blogs and announcements will be posted on the <a 
href="https://datafusion.apache.org/";>Apache
-DataFusion</a> website.</p>
-
-<h2 id="get-involved">Get Involved</h2>
-
-<p>If you are interested in joining the community, we would love to have you 
join
-us. Get in touch using <a 
href="https://datafusion.apache.org/contributor-guide/communication.html";>Communication
 Doc</a> and learn how to get involved in the
-<a 
href="https://datafusion.apache.org/contributor-guide/index.html";>Contributor 
Guide</a>. We welcome everyone to try DataFusion on their
-own data and projects and let us know how it goes, contribute suggestions,
-documentation, bug reports, or a PR with documentation, tests or 
code.</p>]]></content><author><name>pmc</name></author><category 
term="subprojects" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry></feed>
\ No newline at end of file
+<p>There are also many <a 
href="https://github.com/apache/datafusion-comet/contribute";>good first 
issues</a> waiting for 
contributions.</p>]]></content><author><name>pmc</name></author><category 
term="subprojects" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry></feed>
\ No newline at end of file
diff --git a/img/clickbench-datafusion-43/column-based-storage.png 
b/img/clickbench-datafusion-43/column-based-storage.png
new file mode 100644
index 0000000..8130fcc
Binary files /dev/null and 
b/img/clickbench-datafusion-43/column-based-storage.png differ
diff --git a/img/clickbench-datafusion-43/perf-over-time.png 
b/img/clickbench-datafusion-43/perf-over-time.png
new file mode 100644
index 0000000..f4bf673
Binary files /dev/null and b/img/clickbench-datafusion-43/perf-over-time.png 
differ
diff --git a/img/clickbench-datafusion-43/perf.png 
b/img/clickbench-datafusion-43/perf.png
new file mode 100644
index 0000000..6e12e66
Binary files /dev/null and b/img/clickbench-datafusion-43/perf.png differ
diff --git a/img/clickbench-datafusion-43/row-based-storage.png 
b/img/clickbench-datafusion-43/row-based-storage.png
new file mode 100644
index 0000000..07fef67
Binary files /dev/null and b/img/clickbench-datafusion-43/row-based-storage.png 
differ
diff --git a/img/clickbench-datafusion-43/skipping-partial-aggregation.png 
b/img/clickbench-datafusion-43/skipping-partial-aggregation.png
new file mode 100644
index 0000000..411464d
Binary files /dev/null and 
b/img/clickbench-datafusion-43/skipping-partial-aggregation.png differ
diff --git a/img/clickbench-datafusion-43/string-view-take.png 
b/img/clickbench-datafusion-43/string-view-take.png
new file mode 100644
index 0000000..0cedd25
Binary files /dev/null and b/img/clickbench-datafusion-43/string-view-take.png 
differ
diff --git a/index.html b/index.html
index c136003..42665fa 100644
--- a/index.html
+++ b/index.html
@@ -48,6 +48,11 @@
           <a class="post-link" 
href="/blog/2024/11/19/datafusion-python-udf-comparisons/">
             Comparing approaches to User Defined Functions in Apache 
DataFusion using Python
           </a>
+        </h3></li><li><span class="post-meta">Nov 18, 2024</span>
+        <h3>
+          <a class="post-link" 
href="/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/">
+            Apache DataFusion is now the fastest single node engine for 
querying Apache Parquet files
+          </a>
         </h3></li><li><span class="post-meta">Sep 27, 2024</span>
         <h3>
           <a class="post-link" href="/blog/2024/09/27/datafusion-comet-0.3.0/">


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch asf-site updated: Publish Apache DataFusion is now the fastest single node engine for querying Apache Parquet files (#39)

Reply via email to