This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 7aafb14 Publish Apache DataFusion is now the fastest single node
engine for querying Apache Parquet files (#39)
7aafb14 is described below
commit 7aafb142744f5add2cd33d78b50d223277f9ab26
Author: Andrew Lamb <[email protected]>
AuthorDate: Thu Nov 21 13:20:20 2024 -0500
Publish Apache DataFusion is now the fastest single node engine for
querying Apache Parquet files (#39)
---
.../index.html | 336 +++++++++++++++++++++
README.md | 8 +-
assets/main.css.map | 2 +-
feed.xml | 310 +++++++++++++++----
.../column-based-storage.png | Bin 0 -> 47390 bytes
img/clickbench-datafusion-43/perf-over-time.png | Bin 0 -> 62185 bytes
img/clickbench-datafusion-43/perf.png | Bin 0 -> 57326 bytes
img/clickbench-datafusion-43/row-based-storage.png | Bin 0 -> 45024 bytes
.../skipping-partial-aggregation.png | Bin 0 -> 133362 bytes
img/clickbench-datafusion-43/string-view-take.png | Bin 0 -> 66566 bytes
index.html | 5 +
11 files changed, 604 insertions(+), 57 deletions(-)
diff --git
a/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html
b/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html
new file mode 100644
index 0000000..5340bd5
--- /dev/null
+++ b/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/index.html
@@ -0,0 +1,336 @@
+<!DOCTYPE html>
+<html lang="en"><head>
+ <meta charset="utf-8">
+ <meta http-equiv="X-UA-Compatible" content="IE=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1"><!--
Begin Jekyll SEO tag v2.8.0 -->
+<title>Apache DataFusion is now the fastest single node engine for querying
Apache Parquet files | Apache DataFusion Project News & Blog</title>
+<meta name="generator" content="Jekyll v4.3.3" />
+<meta property="og:title" content="Apache DataFusion is now the fastest single
node engine for querying Apache Parquet files" />
+<meta name="author" content="Andrew Lamb, Staff Engineer at InfluxData" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="<!–" />
+<meta property="og:description" content="<!–" />
+<link rel="canonical"
href="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"
/>
+<meta property="og:url"
content="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"
/>
+<meta property="og:site_name" content="Apache DataFusion Project News &
Blog" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2024-11-18T00:00:00+00:00" />
+<meta name="twitter:card" content="summary" />
+<meta property="twitter:title" content="Apache DataFusion is now the fastest
single node engine for querying Apache Parquet files" />
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"Andrew
Lamb, Staff Engineer at
InfluxData"},"dateModified":"2024-11-18T00:00:00+00:00","datePublished":"2024-11-18T00:00:00+00:00","description":"<!–","headline":"Apache
DataFusion is now the fastest single node engine for querying Apache Parquet
files","mainEntityOfPage":{"@type":"WebPage","@id":"https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"},"
[...]
+<!-- End Jekyll SEO tag -->
+<link rel="stylesheet" href="/blog/assets/main.css"><link
type="application/atom+xml" rel="alternate"
href="https://datafusion.apache.org/blog/feed.xml" title="Apache DataFusion
Project News & Blog" /></head>
+<body><header class="site-header" role="banner">
+
+ <div class="wrapper"><a class="site-title" rel="author" href="/blog/">Apache
DataFusion Project News & Blog</a><nav class="site-nav">
+ <input type="checkbox" id="nav-trigger" class="nav-trigger" />
+ <label for="nav-trigger">
+ <span class="menu-icon">
+ <svg viewBox="0 0 18 15" width="18px" height="15px">
+ <path
d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0
h15.032C17.335,0,18,0.665,18,1.484L18,1.484z
M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0
c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z
M18,13.516C18,14.335,17.335,15,16.516,15H1.484
C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
+ </svg>
+ </span>
+ </label>
+
+ <div class="trigger"><a class="page-link"
href="/blog/about/">About</a></div>
+ </nav></div>
+</header>
+<main class="page-content" aria-label="Content">
+ <div class="wrapper">
+ <article class="post h-entry" itemscope
itemtype="http://schema.org/BlogPosting">
+
+ <header class="post-header">
+ <h1 class="post-title p-name" itemprop="name headline">Apache DataFusion
is now the fastest single node engine for querying Apache Parquet files</h1>
+ <p class="post-meta">
+ <time class="dt-published" datetime="2024-11-18T00:00:00+00:00"
itemprop="datePublished">Nov 18, 2024
+ </time>• <span itemprop="author" itemscope
itemtype="http://schema.org/Person"><span class="p-author h-card"
itemprop="name">Andrew Lamb, Staff Engineer at InfluxData</span></span></p>
+ </header>
+
+ <div class="post-content e-content" itemprop="articleBody">
+ <!--
+
+-->
+
+<p>I am extremely excited to announce that <a
href="https://crates.io/crates/datafusion">Apache DataFusion</a> is the
+fastest engine for querying Apache Parquet files in <a
href="https://benchmark.clickhouse.com/">ClickBench</a>. It is faster
+than <a href="https://duckdb.org/">DuckDB</a>, <a
href="https://clickhouse.com/chdb">chDB</a> and <a
href="https://clickhouse.com/">Clickhouse</a> using the same hardware. It also
marks
+the first time a <a href="https://www.rust-lang.org/">Rust</a>-based engine
holds the top spot, which has previously
+been held by traditional C/C++-based engines.</p>
+
+<p><img src="/blog/img/2x_bgwhite_original.png" width="80%"
class="img-responsive" alt="Apache DataFusion Logo" /></p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf.png" width="100%"
class="img-responsive" alt="ClickBench performance for DataFusion 43.0.0" /></p>
+
+<p><strong>Figure 1</strong>: 2024-11-16 <a
href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6dHJ1ZSwiY2hEQiI6ZmFsc2UsIkNpdHVzIjpmYWxzZSwiQ2xpY2tIb3VzZSBDbG91ZCAoYXdzKSI6
[...]
+partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a <code
class="language-plaintext highlighter-rouge">c6a.4xlarge</code> (16
+CPU / 32 GB RAM) VM. Measurements are relative (<code
class="language-plaintext highlighter-rouge">1.x</code>) to results using
+different hardware.</p>
+
+<p>Best in class performance on Parquet is now available to anyone.
DataFusion’s
+open design lets you start quickly with a full featured Query Engine, including
+SQL, data formats, catalogs, and more, and then customize any behavior you
need.
+I predict the continued emergence of new classes of data systems now that
+creators can focus the bulk of their innovation on areas such as query
+languages, system integrations, and data formats rather than trying to play
+catchup with core engine performance.</p>
+
+<p>ClickBench also includes results for proprietary storage formats, which
require
+costly load / export steps, making them useful in fewer use cases and thus much
+less important than open formats (though the idea of use case specific formats
+is interesting<sup id="fnref:2" role="doc-noteref"><a href="#fn:2"
class="footnote" rel="footnote">2</a></sup>).</p>
+
+<p>This blog post highlights some of the techniques we used to achieve this
+performance, and celebrates the teamwork involved.</p>
+
+<h1 id="a-strong-history-of-performance-improvements">A Strong History of
Performance Improvements</h1>
+
+<p>Performance has long been a core focus for DataFusion’s community, and
+speed attracts users and contributors. Recently, we seem to have been
+even more focused on performance, including in July, 2024 when <a
href="https://www.linkedin.com/in/mehmet-ozan-kabak/">Mehmet Ozan
+Kabak</a>, CEO of <a href="https://www.synnada.ai/">Synnada</a>, again <a
href="https://github.com/apache/datafusion/issues/11442#issuecomment-2226834443">suggested
focusing on performance</a>. This
+got many of us excited (who doesn’t love a challenge!), and we have
subsequently
+rallied to steadily improve the performance release on release as shown in
+Figure 2.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf-over-time.png"
width="100%" class="img-responsive" alt="ClickBench performance results over
time for DataFusion" /></p>
+
+<p><strong>Figure 2</strong>: ClickBench performance improved over 30% between
DataFusion 34
+(released Dec. 2023) and DataFusion 43 (released Nov. 2024).</p>
+
+<p>Like all good optimization efforts, ours took sustained effort as
DataFusion ran
+out of <a
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion">single
2x performance improvements</a> several years ago. Working together our
+community of engineers from around the world<sup id="fnref:3"
role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
and all experience levels<sup id="fnref:4" role="doc-noteref"><a href="#fn:4"
class="footnote" rel="footnote">4</a></sup>
+pulled it off (check out <a
href="https://github.com/apache/datafusion/issues/12821">this discussion</a> to
get a sense). It may be a “<a href="https://db.cs.cmu.edu/seminar2024/">hobo
+sandwich</a>” <sup id="fnref:5" role="doc-noteref"><a href="#fn:5"
class="footnote" rel="footnote">5</a></sup>, but it is a tasty one!</p>
+
+<p>Of course, most of these techniques have been implemented and described
before,
+but until now they were only available in proprietary systems such as
+<a href="https://www.vertica.com/">Vertica</a>, <a
href="https://www.databricks.com/product/photon">DataBricks
+Photon</a>, or
+<a href="https://www.snowflake.com/en/">Snowflake</a> or in tightly integrated
open source
+systems such as <a href="https://duckdb.org/">DuckDB</a> or
+<a href="https://clickhouse.com/">ClickHouse</a> which were not designed to be
extended.</p>
+
+<h2 id="stringview">StringView</h2>
+
+<p>Performance improved for all queries when DataFusion switched to using Arrow
+<code class="language-plaintext highlighter-rouge">StringView</code>. Using
<code class="language-plaintext highlighter-rouge">StringView</code> “just”
saves some copies and avoids one memory
+access for certain comparisons. However, these copies and comparisons happen to
+occur in many of the hottest loops during query processing, so optimizing them
+resulted in measurable performance improvements.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/string-view-take.png"
width="80%" class="img-responsive" alt="Illustration of how take works with
StringView" /></p>
+
+<p><strong>Figure 3:</strong> Figure from <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/">Using
StringView / German Style Strings to Make
+Queries Faster: Part 1</a> showing how <code class="language-plaintext
highlighter-rouge">StringView</code> saves copying data in many cases.</p>
+
+<p>Using StringView to make DataFusion faster for ClickBench required
substantial
+careful, low level optimization work described in <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/">Using
StringView / German
+Style Strings to Make Queries Faster: Part 1</a> and <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/">Part
2</a>. However, it <em>also</em>
+required extending the rest of DataFusion’s operations to support the new type.
+You can get a sense of the magnitude of the work required by looking at the
100+
+pull requests linked to the epic in arrow-rs
+(<a href="https://github.com/apache/arrow-rs/issues/5374">here</a>) and three
major epics
+(<a href="https://github.com/apache/datafusion/issues/10918">here</a>,
+<a href="https://github.com/apache/datafusion/issues/11790">here</a> and
+<a href="https://github.com/apache/datafusion/issues/11752">here</a>) in
DataFusion.</p>
+
+<p>Here is a partial list of people involved in the project (I am sorry to
those whom I forgot)</p>
+
+<ul>
+ <li><strong>Arrow</strong>: <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a> (InfluxData’s amazing
2024 summer intern and UW Madison PhD), <a
href="https://github.com/ariesdevil">Yijun Zhao</a> from DataBend Labs, and <a
href="https://github.com/tustvold">Raphael Taylor-Davies</a> laid the
foundation. <a href="https://github.com/RinChanNOWWW">RinChanNOW</a> from
Tencent and <a href="https://github.com/a10y">Andrew Duffy</a> from SpiralDB
helped push it along in the early d [...]
+ <li><strong>DataFusion</strong>: <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a>, again charted the
initial path and <a href="https://github.com/Weijun-H">Weijun Huang</a>, <a
href="https://github.com/dharanad">Dharan Aditya</a> <a
href="https://github.com/Lordworms">Lordworms</a>, <a
href="https://github.com/goldmedal">Jax Liu</a>, <a
href="https://github.com/wiedld">wiedld</a>, <a
href="https://github.com/tlm365">Tai Le Manh</a>, <a
href="https://github.com/my-vegetable [...]
+ <li><strong>DataFusion String Function Migration</strong>: <a
href="https://github.com/tshauck">Trent Hauck</a> organized the effort and set
the patterns, <a href="https://github.com/goldmedal">Jax Liu</a> made a clever
testing framework, and <a href="https://github.com/austin362667">Austin
Liu</a>, <a href="https://github.com/demetribu">Dmitrii Bu</a>, <a
href="https://github.com/tlm365">Tai Le Manh</a>, <a
href="https://github.com/PsiACE">Chojan Shang</a>, <a href="https://github.co
[...]
+</ul>
+
+<h2 id="parquet">Parquet</h2>
+
+<p>Part of the reason for DataFusion’s speed in ClickBench is reading Parquet
files (really) quickly,
+which reflects invested effort in the Parquet reading system (see <a
href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/">Querying
+Parquet with Millisecond Latency</a> )</p>
+
+<p>The <a
href="https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html">DataFusion
ParquetExec</a> (built on the <a href="https://crates.io/crates/parquet">Rust
Parquet Implementation</a>) is now the most
+sophisticated open source Parquet reader I know of. It has every optimization
we
+can think of for reading Parquet, including projection pushdown, predicate
+pushdown (row group metadata, page index, and bloom filters), limit pushdown,
+parallel reading, interleaved I/O, and late materialized filtering (coming
soon ™️
+by default). Some recent work from <a
href="https://github.com/itsjunetime">June</a>
+<a href="https://github.com/apache/datafusion/pull/12135">recently unblocked a
remaining hurdle</a> for enabling late materialized
+filtering, and conveniently <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a> is
+working on the <a
href="https://github.com/apache/arrow-datafusion/issues/3463">final piece</a>
(no pressure😅)</p>
+
+<h2 id="skipping-partial-aggregation-when-it-doesnt-help">Skipping Partial
Aggregation When It Doesn’t Help</h2>
+
+<p>Many ClickBench queries are aggregations that summarize millions of rows, a
+common task for reporting and dashboarding. DataFusion uses state of the art
+<a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state">two
phase aggregation</a> plans. Normally, two phase aggregation works well as the
+first phase consolidates many rows immediately after reading, while the data is
+still in cache. However, for certain “high cardinality” aggregate queries (that
+have large numbers of groups), <a
href="https://github.com/apache/datafusion/issues/6937">the two phase
aggregation strategy used in
+DataFusion was inefficient</a>,
+manifesting in relatively slower performance compared to other engines for
+ClickBench queries such as</p>
+
+<div class="language-sql highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">SELECT</span> <span
class="nv">"WatchID"</span><span class="p">,</span> <span
class="nv">"ClientIP"</span><span class="p">,</span> <span
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span
class="p">)</span> <span class="k">AS</span> <span class="k">c</span><span
class="p">,</span> <span class="p">...</span>
+<span class="k">FROM</span> <span class="n">hits</span>
+<span class="k">GROUP</span> <span class="k">BY</span> <span
class="nv">"WatchID"</span><span class="p">,</span> <span
class="nv">"ClientIP"</span> <span class="cm">/* <----- 13M Distinct
Groups!!! */</span>
+<span class="k">ORDER</span> <span class="k">BY</span> <span
class="k">c</span> <span class="k">DESC</span>
+<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
+</code></pre></div></div>
+
+<p>For such queries, the first aggregation phase does not significantly
+reduce the number of rows, which wastes significant effort. <a
href="https://github.com/korowa">Eduard
+Karacharov</a> contributed a <a
href="https://github.com/apache/datafusion/pull/11627">dynamic strategy</a> to
+bypass the first phase when it is not working efficiently, shown in Figure
4.</p>
+
+<p><img
src="/blog/img/clickbench-datafusion-43/skipping-partial-aggregation.png"
width="100%" class="img-responsive" alt="Two phase aggregation diagram from
DataFusion API docs annotated to show first phase not helping" /></p>
+
+<p><strong>Figure 4</strong>: Diagram from <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state">DataFusion
API docs</a> showing when the multi-phase
+grouping is not effective</p>
+
+<h2 id="optimized-multi-column-grouping">Optimized Multi-Column Grouping</h2>
+
+<p>Another method for improving analytic database performance is specialized
(aka
+highly optimized) versions of operations for different data types, which the
+system picks at runtime based on the query. Like other systems, DataFusion has
+specialized code for handling different types of group columns. For example,
+there is <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs">special
code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP
BY int_id</code> and <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/bytes.rs">different
special
+code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP
BY string_id</code> .</p>
+
+<p>When a query groups by multiple columns, it is tricker to apply this
technique.
+For example <code class="language-plaintext highlighter-rouge">GROUP BY
string_id, int_id</code> and <code class="language-plaintext
highlighter-rouge">GROUP BY int_id, string_id</code> have
+different optimal structures, but it is not possible to include specialized
+versions for all possible combinations of group column types.</p>
+
+<p>DataFusion includes <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/row.rs#L33-L39">a
general Row based mechanism</a> that works for any
+combination of column types, but this general mechanism copies each value twice
+as shown in Figure 5. The cost of this copy <a
href="https://github.com/apache/datafusion/issues/9403">is especially high for
variable
+length strings and binary data</a>.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/row-based-storage.png"
width="100%" class="img-responsive" alt="Row based storage for multiple group
columns" /></p>
+
+<p><strong>Figure 5</strong>: Prior to DataFusion 43.0.0, queries with
multiple group columns
+used Row based group storage and copied each group value twice. This copy
+consumes a substantial amount of the query time for queries with many distinct
+groups, such as several of the queries in ClickBench.</p>
+
+<p>Many optimizations in Databases boil down to simply avoiding copies, and
this
+was no exception. The trick was to figure out how to avoid copies without
+causing per-column comparison overhead to dominate or complexity to get out of
+hand. In a great example of diligent and disciplined engineering, <a
href="https://github.com/jayzhan211">Jay
+Zhan</a> tried <a
href="https://github.com/apache/datafusion/pull/10937">several</a>, <a
href="https://github.com/apache/datafusion/pull/10976">different</a> approaches
until arriving
+at the <a href="https://github.com/apache/datafusion/pull/12269">one shipped
in DataFusion <code class="language-plaintext
highlighter-rouge">43.0.0</code></a>, shown in Figure 6.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/column-based-storage.png"
width="100%" class="img-responsive" alt="Column based storage for multiple
group columns" /></p>
+
+<p><strong>Figure 6</strong>: DataFusion 43.0.0’s new columnar group storage
copies each group
+value exactly once, which is significantly faster when grouping by multiple
+columns.</p>
+
+<p>Huge thanks as well to <a href="https://github.com/eejbyfeldt">Emil
Ejbyfeldt</a> and
+<a href="https://github.com/Dandandan">Daniël Heres</a> for their help
reviewing and to
+<a href="https://github.com/Rachelint">Rachelint (kamille</a>) for reviewing
and
+contributing a faster <a
href="https://github.com/apache/datafusion/pull/12996">vectorized append and
compare for multiple groups</a> which
+will be released in DataFusion 44. The discussion on <a
href="https://github.com/apache/datafusion/issues/9403">the ticket</a> is
another
+great example of the power of the DataFusion community working together to
build
+great software.</p>
+
+<h1 id="whats-next-">What’s Next 🚀</h1>
+
+<p>Just as I expect the performance of other engines to improve, DataFusion has
+several more performance improvements lined up itself:</p>
+
+<ol>
+ <li><a
href="https://github.com/apache/datafusion/pull/11943#top">Intermediate results
blocked management</a> (thanks again <a
href="https://github.com/Rachelint">Rachelint (kamille</a>)</li>
+ <li><a href="https://github.com/apache/datafusion/issues/3463">Enable
parquet filter pushdown by default</a></li>
+</ol>
+
+<p>We are also talking about what to focus on over the <a
href="https://github.com/apache/datafusion/issues/13274">next three
+months</a> and are always
+looking for people to help! If you want to geek out (obsess??) about
performance
+and other features with engineers from around the world, <a
href="https://datafusion.apache.org/contributor-guide/communication.html">we
would love you to
+join us</a>.</p>
+
+<h1 id="additional-thanks">Additional Thanks</h1>
+
+<p>In addition to the people called out above, thanks:</p>
+
+<ol>
+ <li><a href="https://github.com/pmcgleenon">Patrick McGleenon</a> for
running ClickBench and gathering this data (<a
href="https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793">source</a>).</li>
+ <li>Everyone I missed in the shoutouts – there are so many of you. We
appreciate everyone.</li>
+</ol>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>I have dreamed about DataFusion being on top of the ClickBench leaderboard
for
+several years. I often watched with envy improvements in systems backed by
large
+VC investments, internet companies, or world class research institutions, and
+doubted that we could pull off something similar in an open source project with
+always limited time.</p>
+
+<p>The fact that we have now surpassed those other systems in query
performance I
+think speaks to the power and possibility of focusing on community and aligning
+our collective enthusiasm and skills towards a common goal. Of course, being on
+the top in any particular benchmark is likely fleeting as other engines will
+improve, but so will DataFusion!</p>
+
+<p>I love working on DataFusion – the people, the quality of the code, my
+interactions and the results we have achieved together far surpass my
+expectations as well as most of my other software development experiences. I
+can’t wait to see what people will build next, and hope to <a
href="https://github.com/apache/datafusion">see you
+online</a>.</p>
+
+<h2 id="notes">Notes</h2>
+
+<div class="footnotes" role="doc-endnotes">
+ <ol>
+ <li id="fn:1" role="doc-endnote">
+ <p>Note that DuckDB is slightly faster on the ‘cold’ run. <a
href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:2" role="doc-endnote">
+ <p>Want to try your hand at a custom format for ClickBench fame /
glory?: <a href="https://github.com/apache/datafusion/issues/13448">Make
DataFusion the fastest engine in ClickBench with custom file format</a> <a
href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:3" role="doc-endnote">
+ <p>We have contributors from North America, South American, Europe,
Asia, Africa and Australia <a href="#fnref:3" class="reversefootnote"
role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:4" role="doc-endnote">
+ <p>Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety
experienced engineers <a href="#fnref:4" class="reversefootnote"
role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:5" role="doc-endnote">
+ <p>Thanks to Andy Pavlo, I love that nomenclature <a href="#fnref:5"
class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ </ol>
+</div>
+
+ </div><a class="u-url"
href="/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"
hidden></a>
+</article>
+
+ </div>
+ </main><footer class="site-footer h-card">
+ <data class="u-url" href="/blog/"></data>
+
+ <div class="wrapper">
+
+ <h2 class="footer-heading">Apache DataFusion Project News & Blog</h2>
+
+ <div class="footer-col-wrapper">
+ <div class="footer-col footer-col-1">
+ <ul class="contact-list">
+ <li class="p-name">Apache DataFusion Project News &
Blog</li><li><a class="u-email"
href="mailto:[email protected]">[email protected]</a></li></ul>
+ </div>
+
+ <div class="footer-col footer-col-2"><ul
class="social-media-list"><li><a
href="https://www.twitter.com/ApacheDataFusio"><svg class="svg-icon"><use
xlink:href="/blog/assets/minima-social-icons.svg#twitter"></use></svg> <span
class="username">ApacheDataFusio</span></a></li></ul>
+</div>
+
+ <div class="footer-col footer-col-3">
+ <p>Apache DataFusion is a very fast, extensible query engine for
building high-quality data-centric systems in Rust, using the Apache Arrow
in-memory format.</p>
+ </div>
+ </div>
+
+ </div>
+
+</footer>
+</body>
+
+</html>
diff --git a/README.md b/README.md
index 3e34ab1..17f4395 100644
--- a/README.md
+++ b/README.md
@@ -72,13 +72,17 @@ git checkout asf-site
git pull
# create a branch for the publishing
git checkout -b publish_blog
-# push code upstream
-git push
# copy content built from _site directory
cp -R ../datafusion-site/_site/* .
git commit -a -m 'Publish blog content'
+# push code upstream
+git push
```
#### Make PR, targeting the `asf-site` branch
For example, see https://github.com/apache/datafusion-site/pull/9
+#### Check site status
+
+The website is updated from the `asf-site` branch. You can check the status at
+[ASF Infra sitesource](https://infra-reports.apache.org/#sitesource)
diff --git a/assets/main.css.map b/assets/main.css.map
index 4da063c..3dde519 100644
--- a/assets/main.css.map
+++ b/assets/main.css.map
@@ -1 +1 @@
-{"version":3,"sourceRoot":"","sources":["../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_base.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_layout.scss","../../../../.gem/ruby/3.1.3/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EA
[...]
\ No newline at end of file
+{"version":3,"sourceRoot":"","sources":["../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_base.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_layout.scss","../../usr/local/bundle/gems/minima-2.5.1/_sass/minima/_syntax-highlighting.scss"],"names":[],"mappings":"AAAA;AAAA;AAAA;AAGA;AAAA;AAAA;EAGE;EACA;;;AAKF;AAAA;AAAA;AAGA;EACE;EACA,OCLiB;EDMjB,kBCLiB;EDMjB;EACA;EACG;EACE;EACG;EACR;EACA;EACA;EACA;;;AAKF;AAAA;
[...]
\ No newline at end of file
diff --git a/feed.xml b/feed.xml
index d947f25..1da1efd 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.3.3">Jekyll</generator><link
href="https://datafusion.apache.org/blog/feed.xml" rel="self"
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"
rel="alternate" type="text/html"
/><updated>2024-11-20T21:23:41+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
type="html">Apache DataFusion Project News &amp; [...]
+<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.3.3">Jekyll</generator><link
href="https://datafusion.apache.org/blog/feed.xml" rel="self"
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"
rel="alternate" type="text/html"
/><updated>2024-11-21T18:18:11+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
type="html">Apache DataFusion Project News &amp; [...]
-->
@@ -692,7 +692,260 @@ for their helpful reviews and feedback.</p>
<p>Lastly, the Apache Arrow and DataFusion community is an active group of
very helpful people working
to make a great tool. If you want to get involved, please take a look at the
<a href="https://datafusion.apache.org/python/">online documentation</a> and
jump in to help with one of the
-<a href="https://github.com/apache/datafusion-python/issues">open
issues</a>.</p>]]></content><author><name>timsaucer</name></author><category
term="tutorial" /><summary
type="html"><![CDATA[<!–]]></summary></entry><entry><title
type="html">Apache DataFusion Comet 0.3.0 Release</title><link
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"
rel="alternate" type="text/html" title="Apache DataFusion Comet 0.3.0 Release"
/><published>2024-09-27T00:00:00+00:00</p [...]
+<a href="https://github.com/apache/datafusion-python/issues">open
issues</a>.</p>]]></content><author><name>timsaucer</name></author><category
term="tutorial" /><summary
type="html"><![CDATA[<!–]]></summary></entry><entry><title
type="html">Apache DataFusion is now the fastest single node engine for
querying Apache Parquet files</title><link
href="https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/"
rel="alternate" type="text/html" title="A [...]
+
+-->
+
+<p>I am extremely excited to announce that <a
href="https://crates.io/crates/datafusion">Apache DataFusion</a> is the
+fastest engine for querying Apache Parquet files in <a
href="https://benchmark.clickhouse.com/">ClickBench</a>. It is faster
+than <a href="https://duckdb.org/">DuckDB</a>, <a
href="https://clickhouse.com/chdb">chDB</a> and <a
href="https://clickhouse.com/">Clickhouse</a> using the same hardware. It also
marks
+the first time a <a href="https://www.rust-lang.org/">Rust</a>-based engine
holds the top spot, which has previously
+been held by traditional C/C++-based engines.</p>
+
+<p><img src="/blog/img/2x_bgwhite_original.png" width="80%"
class="img-responsive" alt="Apache DataFusion Logo" /></p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf.png" width="100%"
class="img-responsive" alt="ClickBench performance for DataFusion 43.0.0" /></p>
+
+<p><strong>Figure 1</strong>: 2024-11-16 <a
href="https://benchmark.clickhouse.com/#eyJzeXN0ZW0iOnsiQWxsb3lEQiI6ZmFsc2UsIkFsbG95REIgKHR1bmVkKSI6ZmFsc2UsIkF0aGVuYSAocGFydGl0aW9uZWQpIjpmYWxzZSwiQXRoZW5hIChzaW5nbGUpIjpmYWxzZSwiQXVyb3JhIGZvciBNeVNRTCI6ZmFsc2UsIkF1cm9yYSBmb3IgUG9zdGdyZVNRTCI6ZmFsc2UsIkJ5Q29uaXR5IjpmYWxzZSwiQnl0ZUhvdXNlIjpmYWxzZSwiY2hEQiAoRGF0YUZyYW1lKSI6ZmFsc2UsImNoREIgKFBhcnF1ZXQsIHBhcnRpdGlvbmVkKSI6dHJ1ZSwiY2hEQiI6ZmFsc2UsIkNpdHVzIjpmYWxzZSwiQ2xpY2tIb3VzZSBDbG91ZCAoYXdzKSI6
[...]
+partitioned 14 GB Parquet dataset (100 files, each ~140MB) on a <code
class="language-plaintext highlighter-rouge">c6a.4xlarge</code> (16
+CPU / 32 GB RAM) VM. Measurements are relative (<code
class="language-plaintext highlighter-rouge">1.x</code>) to results using
+different hardware.</p>
+
+<p>Best in class performance on Parquet is now available to anyone.
DataFusion’s
+open design lets you start quickly with a full featured Query Engine, including
+SQL, data formats, catalogs, and more, and then customize any behavior you
need.
+I predict the continued emergence of new classes of data systems now that
+creators can focus the bulk of their innovation on areas such as query
+languages, system integrations, and data formats rather than trying to play
+catchup with core engine performance.</p>
+
+<p>ClickBench also includes results for proprietary storage formats, which
require
+costly load / export steps, making them useful in fewer use cases and thus much
+less important than open formats (though the idea of use case specific formats
+is interesting<sup id="fnref:2" role="doc-noteref"><a href="#fn:2"
class="footnote" rel="footnote">2</a></sup>).</p>
+
+<p>This blog post highlights some of the techniques we used to achieve this
+performance, and celebrates the teamwork involved.</p>
+
+<h1 id="a-strong-history-of-performance-improvements">A Strong History of
Performance Improvements</h1>
+
+<p>Performance has long been a core focus for DataFusion’s community, and
+speed attracts users and contributors. Recently, we seem to have been
+even more focused on performance, including in July, 2024 when <a
href="https://www.linkedin.com/in/mehmet-ozan-kabak/">Mehmet Ozan
+Kabak</a>, CEO of <a href="https://www.synnada.ai/">Synnada</a>, again <a
href="https://github.com/apache/datafusion/issues/11442#issuecomment-2226834443">suggested
focusing on performance</a>. This
+got many of us excited (who doesn’t love a challenge!), and we have
subsequently
+rallied to steadily improve the performance release on release as shown in
+Figure 2.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/perf-over-time.png"
width="100%" class="img-responsive" alt="ClickBench performance results over
time for DataFusion" /></p>
+
+<p><strong>Figure 2</strong>: ClickBench performance improved over 30% between
DataFusion 34
+(released Dec. 2023) and DataFusion 43 (released Nov. 2024).</p>
+
+<p>Like all good optimization efforts, ours took sustained effort as
DataFusion ran
+out of <a
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion">single
2x performance improvements</a> several years ago. Working together our
+community of engineers from around the world<sup id="fnref:3"
role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>
and all experience levels<sup id="fnref:4" role="doc-noteref"><a href="#fn:4"
class="footnote" rel="footnote">4</a></sup>
+pulled it off (check out <a
href="https://github.com/apache/datafusion/issues/12821">this discussion</a> to
get a sense). It may be a “<a href="https://db.cs.cmu.edu/seminar2024/">hobo
+sandwich</a>” <sup id="fnref:5" role="doc-noteref"><a href="#fn:5"
class="footnote" rel="footnote">5</a></sup>, but it is a tasty one!</p>
+
+<p>Of course, most of these techniques have been implemented and described
before,
+but until now they were only available in proprietary systems such as
+<a href="https://www.vertica.com/">Vertica</a>, <a
href="https://www.databricks.com/product/photon">DataBricks
+Photon</a>, or
+<a href="https://www.snowflake.com/en/">Snowflake</a> or in tightly integrated
open source
+systems such as <a href="https://duckdb.org/">DuckDB</a> or
+<a href="https://clickhouse.com/">ClickHouse</a> which were not designed to be
extended.</p>
+
+<h2 id="stringview">StringView</h2>
+
+<p>Performance improved for all queries when DataFusion switched to using Arrow
+<code class="language-plaintext highlighter-rouge">StringView</code>. Using
<code class="language-plaintext highlighter-rouge">StringView</code> “just”
saves some copies and avoids one memory
+access for certain comparisons. However, these copies and comparisons happen to
+occur in many of the hottest loops during query processing, so optimizing them
+resulted in measurable performance improvements.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/string-view-take.png"
width="80%" class="img-responsive" alt="Illustration of how take works with
StringView" /></p>
+
+<p><strong>Figure 3:</strong> Figure from <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/">Using
StringView / German Style Strings to Make
+Queries Faster: Part 1</a> showing how <code class="language-plaintext
highlighter-rouge">StringView</code> saves copying data in many cases.</p>
+
+<p>Using StringView to make DataFusion faster for ClickBench required
substantial
+careful, low level optimization work described in <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-one-influxdb/">Using
StringView / German
+Style Strings to Make Queries Faster: Part 1</a> and <a
href="https://www.influxdata.com/blog/faster-queries-with-stringview-part-two-influxdb/">Part
2</a>. However, it <em>also</em>
+required extending the rest of DataFusion’s operations to support the new type.
+You can get a sense of the magnitude of the work required by looking at the
100+
+pull requests linked to the epic in arrow-rs
+(<a href="https://github.com/apache/arrow-rs/issues/5374">here</a>) and three
major epics
+(<a href="https://github.com/apache/datafusion/issues/10918">here</a>,
+<a href="https://github.com/apache/datafusion/issues/11790">here</a> and
+<a href="https://github.com/apache/datafusion/issues/11752">here</a>) in
DataFusion.</p>
+
+<p>Here is a partial list of people involved in the project (I am sorry to
those whom I forgot)</p>
+
+<ul>
+ <li><strong>Arrow</strong>: <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a> (InfluxData’s amazing
2024 summer intern and UW Madison PhD), <a
href="https://github.com/ariesdevil">Yijun Zhao</a> from DataBend Labs, and <a
href="https://github.com/tustvold">Raphael Taylor-Davies</a> laid the
foundation. <a href="https://github.com/RinChanNOWWW">RinChanNOW</a> from
Tencent and <a href="https://github.com/a10y">Andrew Duffy</a> from SpiralDB
helped push it along in the early d [...]
+ <li><strong>DataFusion</strong>: <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a>, again charted the
initial path and <a href="https://github.com/Weijun-H">Weijun Huang</a>, <a
href="https://github.com/dharanad">Dharan Aditya</a> <a
href="https://github.com/Lordworms">Lordworms</a>, <a
href="https://github.com/goldmedal">Jax Liu</a>, <a
href="https://github.com/wiedld">wiedld</a>, <a
href="https://github.com/tlm365">Tai Le Manh</a>, <a
href="https://github.com/my-vegetable [...]
+ <li><strong>DataFusion String Function Migration</strong>: <a
href="https://github.com/tshauck">Trent Hauck</a> organized the effort and set
the patterns, <a href="https://github.com/goldmedal">Jax Liu</a> made a clever
testing framework, and <a href="https://github.com/austin362667">Austin
Liu</a>, <a href="https://github.com/demetribu">Dmitrii Bu</a>, <a
href="https://github.com/tlm365">Tai Le Manh</a>, <a
href="https://github.com/PsiACE">Chojan Shang</a>, <a href="https://github.co
[...]
+</ul>
+
+<h2 id="parquet">Parquet</h2>
+
+<p>Part of the reason for DataFusion’s speed in ClickBench is reading Parquet
files (really) quickly,
+which reflects invested effort in the Parquet reading system (see <a
href="https://www.influxdata.com/blog/querying-parquet-millisecond-latency/">Querying
+Parquet with Millisecond Latency</a> )</p>
+
+<p>The <a
href="https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html">DataFusion
ParquetExec</a> (built on the <a href="https://crates.io/crates/parquet">Rust
Parquet Implementation</a>) is now the most
+sophisticated open source Parquet reader I know of. It has every optimization
we
+can think of for reading Parquet, including projection pushdown, predicate
+pushdown (row group metadata, page index, and bloom filters), limit pushdown,
+parallel reading, interleaved I/O, and late materialized filtering (coming
soon ™️
+by default). Some recent work from <a
href="https://github.com/itsjunetime">June</a>
+<a href="https://github.com/apache/datafusion/pull/12135">recently unblocked a
remaining hurdle</a> for enabling late materialized
+filtering, and conveniently <a
href="https://github.com/XiangpengHao">Xiangpeng Hao</a> is
+working on the <a
href="https://github.com/apache/arrow-datafusion/issues/3463">final piece</a>
(no pressure😅)</p>
+
+<h2 id="skipping-partial-aggregation-when-it-doesnt-help">Skipping Partial
Aggregation When It Doesn’t Help</h2>
+
+<p>Many ClickBench queries are aggregations that summarize millions of rows, a
+common task for reporting and dashboarding. DataFusion uses state of the art
+<a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state">two
phase aggregation</a> plans. Normally, two phase aggregation works well as the
+first phase consolidates many rows immediately after reading, while the data is
+still in cache. However, for certain “high cardinality” aggregate queries (that
+have large numbers of groups), <a
href="https://github.com/apache/datafusion/issues/6937">the two phase
aggregation strategy used in
+DataFusion was inefficient</a>,
+manifesting in relatively slower performance compared to other engines for
+ClickBench queries such as</p>
+
+<div class="language-sql highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">SELECT</span> <span
class="nv">"WatchID"</span><span class="p">,</span> <span
class="nv">"ClientIP"</span><span class="p">,</span> <span
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span
class="p">)</span> <span class="k">AS</span> <span class="k">c</span><span
class="p">,</span> <span class="p">...</span>
+<span class="k">FROM</span> <span class="n">hits</span>
+<span class="k">GROUP</span> <span class="k">BY</span> <span
class="nv">"WatchID"</span><span class="p">,</span> <span
class="nv">"ClientIP"</span> <span class="cm">/* <----- 13M Distinct
Groups!!! */</span>
+<span class="k">ORDER</span> <span class="k">BY</span> <span
class="k">c</span> <span class="k">DESC</span>
+<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
+</code></pre></div></div>
+
+<p>For such queries, the first aggregation phase does not significantly
+reduce the number of rows, which wastes significant effort. <a
href="https://github.com/korowa">Eduard
+Karacharov</a> contributed a <a
href="https://github.com/apache/datafusion/pull/11627">dynamic strategy</a> to
+bypass the first phase when it is not working efficiently, shown in Figure
4.</p>
+
+<p><img
src="/blog/img/clickbench-datafusion-43/skipping-partial-aggregation.png"
width="100%" class="img-responsive" alt="Two phase aggregation diagram from
DataFusion API docs annotated to show first phase not helping" /></p>
+
+<p><strong>Figure 4</strong>: Diagram from <a
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html#tymethod.state">DataFusion
API docs</a> showing when the multi-phase
+grouping is not effective</p>
+
+<h2 id="optimized-multi-column-grouping">Optimized Multi-Column Grouping</h2>
+
+<p>Another method for improving analytic database performance is specialized
(aka
+highly optimized) versions of operations for different data types, which the
+system picks at runtime based on the query. Like other systems, DataFusion has
+specialized code for handling different types of group columns. For example,
+there is <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/primitive.rs">special
code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP
BY int_id</code> and <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/single_group_by/bytes.rs">different
special
+code</a> that handles <code class="language-plaintext highlighter-rouge">GROUP
BY string_id</code> .</p>
+
+<p>When a query groups by multiple columns, it is tricker to apply this
technique.
+For example <code class="language-plaintext highlighter-rouge">GROUP BY
string_id, int_id</code> and <code class="language-plaintext
highlighter-rouge">GROUP BY int_id, string_id</code> have
+different optimal structures, but it is not possible to include specialized
+versions for all possible combinations of group column types.</p>
+
+<p>DataFusion includes <a
href="https://github.com/apache/datafusion/blob/73507c307487708deb321e1ba4e0d302084ca27e/datafusion/physical-plan/src/aggregates/group_values/row.rs#L33-L39">a
general Row based mechanism</a> that works for any
+combination of column types, but this general mechanism copies each value twice
+as shown in Figure 5. The cost of this copy <a
href="https://github.com/apache/datafusion/issues/9403">is especially high for
variable
+length strings and binary data</a>.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/row-based-storage.png"
width="100%" class="img-responsive" alt="Row based storage for multiple group
columns" /></p>
+
+<p><strong>Figure 5</strong>: Prior to DataFusion 43.0.0, queries with
multiple group columns
+used Row based group storage and copied each group value twice. This copy
+consumes a substantial amount of the query time for queries with many distinct
+groups, such as several of the queries in ClickBench.</p>
+
+<p>Many optimizations in Databases boil down to simply avoiding copies, and
this
+was no exception. The trick was to figure out how to avoid copies without
+causing per-column comparison overhead to dominate or complexity to get out of
+hand. In a great example of diligent and disciplined engineering, <a
href="https://github.com/jayzhan211">Jay
+Zhan</a> tried <a
href="https://github.com/apache/datafusion/pull/10937">several</a>, <a
href="https://github.com/apache/datafusion/pull/10976">different</a> approaches
until arriving
+at the <a href="https://github.com/apache/datafusion/pull/12269">one shipped
in DataFusion <code class="language-plaintext
highlighter-rouge">43.0.0</code></a>, shown in Figure 6.</p>
+
+<p><img src="/blog/img/clickbench-datafusion-43/column-based-storage.png"
width="100%" class="img-responsive" alt="Column based storage for multiple
group columns" /></p>
+
+<p><strong>Figure 6</strong>: DataFusion 43.0.0’s new columnar group storage
copies each group
+value exactly once, which is significantly faster when grouping by multiple
+columns.</p>
+
+<p>Huge thanks as well to <a href="https://github.com/eejbyfeldt">Emil
Ejbyfeldt</a> and
+<a href="https://github.com/Dandandan">Daniël Heres</a> for their help
reviewing and to
+<a href="https://github.com/Rachelint">Rachelint (kamille</a>) for reviewing
and
+contributing a faster <a
href="https://github.com/apache/datafusion/pull/12996">vectorized append and
compare for multiple groups</a> which
+will be released in DataFusion 44. The discussion on <a
href="https://github.com/apache/datafusion/issues/9403">the ticket</a> is
another
+great example of the power of the DataFusion community working together to
build
+great software.</p>
+
+<h1 id="whats-next-">What’s Next 🚀</h1>
+
+<p>Just as I expect the performance of other engines to improve, DataFusion has
+several more performance improvements lined up itself:</p>
+
+<ol>
+ <li><a
href="https://github.com/apache/datafusion/pull/11943#top">Intermediate results
blocked management</a> (thanks again <a
href="https://github.com/Rachelint">Rachelint (kamille</a>)</li>
+ <li><a href="https://github.com/apache/datafusion/issues/3463">Enable
parquet filter pushdown by default</a></li>
+</ol>
+
+<p>We are also talking about what to focus on over the <a
href="https://github.com/apache/datafusion/issues/13274">next three
+months</a> and are always
+looking for people to help! If you want to geek out (obsess??) about
performance
+and other features with engineers from around the world, <a
href="https://datafusion.apache.org/contributor-guide/communication.html">we
would love you to
+join us</a>.</p>
+
+<h1 id="additional-thanks">Additional Thanks</h1>
+
+<p>In addition to the people called out above, thanks:</p>
+
+<ol>
+ <li><a href="https://github.com/pmcgleenon">Patrick McGleenon</a> for
running ClickBench and gathering this data (<a
href="https://github.com/apache/datafusion/issues/13099#issuecomment-2478314793">source</a>).</li>
+ <li>Everyone I missed in the shoutouts – there are so many of you. We
appreciate everyone.</li>
+</ol>
+
+<h1 id="conclusion">Conclusion</h1>
+
+<p>I have dreamed about DataFusion being on top of the ClickBench leaderboard
for
+several years. I often watched with envy improvements in systems backed by
large
+VC investments, internet companies, or world class research institutions, and
+doubted that we could pull off something similar in an open source project with
+always limited time.</p>
+
+<p>The fact that we have now surpassed those other systems in query
performance I
+think speaks to the power and possibility of focusing on community and aligning
+our collective enthusiasm and skills towards a common goal. Of course, being on
+the top in any particular benchmark is likely fleeting as other engines will
+improve, but so will DataFusion!</p>
+
+<p>I love working on DataFusion – the people, the quality of the code, my
+interactions and the results we have achieved together far surpass my
+expectations as well as most of my other software development experiences. I
+can’t wait to see what people will build next, and hope to <a
href="https://github.com/apache/datafusion">see you
+online</a>.</p>
+
+<h2 id="notes">Notes</h2>
+
+<div class="footnotes" role="doc-endnotes">
+ <ol>
+ <li id="fn:1" role="doc-endnote">
+ <p>Note that DuckDB is slightly faster on the ‘cold’ run. <a
href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:2" role="doc-endnote">
+ <p>Want to try your hand at a custom format for ClickBench fame /
glory?: <a href="https://github.com/apache/datafusion/issues/13448">Make
DataFusion the fastest engine in ClickBench with custom file format</a> <a
href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:3" role="doc-endnote">
+ <p>We have contributors from North America, South American, Europe,
Asia, Africa and Australia <a href="#fnref:3" class="reversefootnote"
role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:4" role="doc-endnote">
+ <p>Undergraduates, PhD, Junior engineers, and getting-kind-of-crotchety
experienced engineers <a href="#fnref:4" class="reversefootnote"
role="doc-backlink">↩</a></p>
+ </li>
+ <li id="fn:5" role="doc-endnote">
+ <p>Thanks to Andy Pavlo, I love that nomenclature <a href="#fnref:5"
class="reversefootnote" role="doc-backlink">↩</a></p>
+ </li>
+ </ol>
+</div>]]></content><author><name>Andrew Lamb, Staff Engineer at
InfluxData</name></author><category term="core" /><category term="performance"
/><summary type="html"><![CDATA[<!–]]></summary></entry><entry><title
type="html">Apache DataFusion Comet 0.3.0 Release</title><link
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"
rel="alternate" type="text/html" title="Apache DataFusion Comet 0.3.0 Release"
/><published>2024-09-27T00:00:00+00:00</published><update [...]
-->
@@ -1780,55 +2033,4 @@ recordings of previous calls.</p>
performance regressions that you find. See the <a
href="https://datafusion.apache.org/comet/user-guide/installation.html">Getting
Started</a> guide for instructions on downloading and installing
Comet.</p>
-<p>There are also many <a
href="https://github.com/apache/datafusion-comet/contribute">good first
issues</a> waiting for
contributions.</p>]]></content><author><name>pmc</name></author><category
term="subprojects" /><summary
type="html"><![CDATA[<!–]]></summary></entry><entry><title
type="html">Announcing Apache Arrow DataFusion is now Apache
DataFusion</title><link
href="https://datafusion.apache.org/blog/2024/05/07/datafusion-tlp/"
rel="alternate" type="text/html" title="Announcing [...]
-
--->
-
-<h2 id="introduction">Introduction</h2>
-
-<p>TLDR; <a href="https://arrow.apache.org/">Apache Arrow</a> DataFusion –>
<a href="https://datafusion.apache.org/">Apache DataFusion</a></p>
-
-<p>The Arrow PMC and newly created DataFusion PMC are happy to announce that
as of
-April 16, 2024 the Apache Arrow DataFusion subproject is now a top level
-<a href="https://www.apache.org/">Apache Software Foundation</a> project.</p>
-
-<h2 id="background">Background</h2>
-
-<p>Apache DataFusion is a fast, extensible query engine for building
high-quality
-data-centric systems in Rust, using the Apache Arrow in-memory format.</p>
-
-<p>When DataFusion was <a
href="https://arrow.apache.org/blog/2019/02/04/datafusion-donation/">donated to
the Apache Software Foundation</a> in 2019, the
-DataFusion community was not large enough to stand on its own and the Arrow
-project agreed to help support it. The community has grown significantly since
-2019, benefiting immensely from being part of Arrow and following <a
href="https://www.apache.org/theapacheway/">The Apache
-Way</a>.</p>
-
-<h2 id="why-now">Why now?</h2>
-
-<p>The community <a
href="https://github.com/apache/datafusion/discussions/6475">discussed
graduating to a top level project publicly</a> for almost
-a year, as the project seemed ready to stand on its own and would benefit from
-more focused governance. For example, earlier in DataFusion’s life many
-contributed to both <a href="https://github.com/apache/arrow-rs">arrow-rs</a>
and DataFusion, but as DataFusion has matured many
-contributors, committers and PMC members focused more and more exclusively on
-DataFusion.</p>
-
-<h2 id="looking-forward">Looking forward</h2>
-
-<p>The future looks bright. There are now <a
href="https://datafusion.apache.org/user-guide/introduction.html#known-users">10s
of known projects built with
-DataFusion</a>, and that number continues to grow. We recently held our <a
href="https://github.com/apache/datafusion/discussions/8522">first in
-person meetup</a> passed <a
href="https://github.com/apache/datafusion/stargazers">5000 stars</a> on
GitHub, <a
href="https://github.com/apache/datafusion/issues/8373#issuecomment-2025133714">wrote
a paper that was accepted
-at SIGMOD 2024</a>, and began work on <a
href="https://github.com/apache/datafusion-comet">Comet</a>, an <a
href="https://spark.apache.org/">Apache Spark</a> accelerator
-<a href="https://arrow.apache.org/blog/2024/03/06/comet-donation/">initially
donated by Apple</a>.</p>
-
-<p>Thank you to everyone in the Arrow community who helped DataFusion grow and
-mature over the years, and we look forward to continuing our collaboration as
-projects. All future blogs and announcements will be posted on the <a
href="https://datafusion.apache.org/">Apache
-DataFusion</a> website.</p>
-
-<h2 id="get-involved">Get Involved</h2>
-
-<p>If you are interested in joining the community, we would love to have you
join
-us. Get in touch using <a
href="https://datafusion.apache.org/contributor-guide/communication.html">Communication
Doc</a> and learn how to get involved in the
-<a
href="https://datafusion.apache.org/contributor-guide/index.html">Contributor
Guide</a>. We welcome everyone to try DataFusion on their
-own data and projects and let us know how it goes, contribute suggestions,
-documentation, bug reports, or a PR with documentation, tests or
code.</p>]]></content><author><name>pmc</name></author><category
term="subprojects" /><summary
type="html"><![CDATA[<!–]]></summary></entry></feed>
\ No newline at end of file
+<p>There are also many <a
href="https://github.com/apache/datafusion-comet/contribute">good first
issues</a> waiting for
contributions.</p>]]></content><author><name>pmc</name></author><category
term="subprojects" /><summary
type="html"><![CDATA[<!–]]></summary></entry></feed>
\ No newline at end of file
diff --git a/img/clickbench-datafusion-43/column-based-storage.png
b/img/clickbench-datafusion-43/column-based-storage.png
new file mode 100644
index 0000000..8130fcc
Binary files /dev/null and
b/img/clickbench-datafusion-43/column-based-storage.png differ
diff --git a/img/clickbench-datafusion-43/perf-over-time.png
b/img/clickbench-datafusion-43/perf-over-time.png
new file mode 100644
index 0000000..f4bf673
Binary files /dev/null and b/img/clickbench-datafusion-43/perf-over-time.png
differ
diff --git a/img/clickbench-datafusion-43/perf.png
b/img/clickbench-datafusion-43/perf.png
new file mode 100644
index 0000000..6e12e66
Binary files /dev/null and b/img/clickbench-datafusion-43/perf.png differ
diff --git a/img/clickbench-datafusion-43/row-based-storage.png
b/img/clickbench-datafusion-43/row-based-storage.png
new file mode 100644
index 0000000..07fef67
Binary files /dev/null and b/img/clickbench-datafusion-43/row-based-storage.png
differ
diff --git a/img/clickbench-datafusion-43/skipping-partial-aggregation.png
b/img/clickbench-datafusion-43/skipping-partial-aggregation.png
new file mode 100644
index 0000000..411464d
Binary files /dev/null and
b/img/clickbench-datafusion-43/skipping-partial-aggregation.png differ
diff --git a/img/clickbench-datafusion-43/string-view-take.png
b/img/clickbench-datafusion-43/string-view-take.png
new file mode 100644
index 0000000..0cedd25
Binary files /dev/null and b/img/clickbench-datafusion-43/string-view-take.png
differ
diff --git a/index.html b/index.html
index c136003..42665fa 100644
--- a/index.html
+++ b/index.html
@@ -48,6 +48,11 @@
<a class="post-link"
href="/blog/2024/11/19/datafusion-python-udf-comparisons/">
Comparing approaches to User Defined Functions in Apache
DataFusion using Python
</a>
+ </h3></li><li><span class="post-meta">Nov 18, 2024</span>
+ <h3>
+ <a class="post-link"
href="/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/">
+ Apache DataFusion is now the fastest single node engine for
querying Apache Parquet files
+ </a>
</h3></li><li><span class="post-meta">Sep 27, 2024</span>
<h3>
<a class="post-link" href="/blog/2024/09/27/datafusion-comet-0.3.0/">
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]