This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 0466888 Comet 0.3.0 blog post (#30)
0466888 is described below
commit 0466888c1ecae6edaacabd5f6424390f9beab179
Author: Andy Grove <[email protected]>
AuthorDate: Mon Oct 7 20:37:06 2024 -0600
Comet 0.3.0 blog post (#30)
---
2024/09/27/datafusion-comet-0.3.0/index.html | 152 +++++++++
feed.xml | 449 +++++----------------------
index.html | 7 +-
3 files changed, 228 insertions(+), 380 deletions(-)
diff --git a/2024/09/27/datafusion-comet-0.3.0/index.html
b/2024/09/27/datafusion-comet-0.3.0/index.html
new file mode 100644
index 0000000..953011f
--- /dev/null
+++ b/2024/09/27/datafusion-comet-0.3.0/index.html
@@ -0,0 +1,152 @@
+<!DOCTYPE html>
+<html lang="en"><head>
+ <meta charset="utf-8">
+ <meta http-equiv="X-UA-Compatible" content="IE=edge">
+ <meta name="viewport" content="width=device-width, initial-scale=1"><!--
Begin Jekyll SEO tag v2.8.0 -->
+<title>Apache DataFusion Comet 0.3.0 Release | Apache DataFusion Project News
& Blog</title>
+<meta name="generator" content="Jekyll v4.3.3" />
+<meta property="og:title" content="Apache DataFusion Comet 0.3.0 Release" />
+<meta name="author" content="pmc" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="<!–" />
+<meta property="og:description" content="<!–" />
+<link rel="canonical"
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/" />
+<meta property="og:url"
content="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"
/>
+<meta property="og:site_name" content="Apache DataFusion Project News &
Blog" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2024-09-27T00:00:00+00:00" />
+<meta name="twitter:card" content="summary" />
+<meta property="twitter:title" content="Apache DataFusion Comet 0.3.0 Release"
/>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"pmc"},"dateModified":"2024-09-27T00:00:00+00:00","datePublished":"2024-09-27T00:00:00+00:00","description":"<!–","headline":"Apache
DataFusion Comet 0.3.0
Release","mainEntityOfPage":{"@type":"WebPage","@id":"https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://datafusion.apache.org/blog/img/2x_bgw
[...]
+<!-- End Jekyll SEO tag -->
+<link rel="stylesheet" href="/blog/assets/main.css"><link
type="application/atom+xml" rel="alternate"
href="https://datafusion.apache.org/blog/feed.xml" title="Apache DataFusion
Project News & Blog" /></head>
+<body><header class="site-header" role="banner">
+
+ <div class="wrapper"><a class="site-title" rel="author" href="/blog/">Apache
DataFusion Project News & Blog</a><nav class="site-nav">
+ <input type="checkbox" id="nav-trigger" class="nav-trigger" />
+ <label for="nav-trigger">
+ <span class="menu-icon">
+ <svg viewBox="0 0 18 15" width="18px" height="15px">
+ <path
d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0
h15.032C17.335,0,18,0.665,18,1.484L18,1.484z
M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0
c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z
M18,13.516C18,14.335,17.335,15,16.516,15H1.484
C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
+ </svg>
+ </span>
+ </label>
+
+ <div class="trigger"><a class="page-link"
href="/blog/about/">About</a></div>
+ </nav></div>
+</header>
+<main class="page-content" aria-label="Content">
+ <div class="wrapper">
+ <article class="post h-entry" itemscope
itemtype="http://schema.org/BlogPosting">
+
+ <header class="post-header">
+ <h1 class="post-title p-name" itemprop="name headline">Apache DataFusion
Comet 0.3.0 Release</h1>
+ <p class="post-meta">
+ <time class="dt-published" datetime="2024-09-27T00:00:00+00:00"
itemprop="datePublished">Sep 27, 2024
+ </time>• <span itemprop="author" itemscope
itemtype="http://schema.org/Person"><span class="p-author h-card"
itemprop="name">pmc</span></span></p>
+ </header>
+
+ <div class="post-content e-content" itemprop="articleBody">
+ <!--
+
+-->
+
+<p>The Apache DataFusion PMC is pleased to announce version 0.3.0 of the <a
href="https://datafusion.apache.org/comet/">Comet</a> subproject.</p>
+
+<p>Comet is an accelerator for Apache Spark that translates Spark physical
plans to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.</p>
+
+<p>Comet runs on commodity hardware and aims to provide 100% compatibility
with Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless
explicitly enabled by the user. Refer
+to the <a
href="https://datafusion.apache.org/comet/user-guide/compatibility.html">compatibility
guide</a> for more information.</p>
+
+<p>This release covers approximately four weeks of development work and is the
result of merging 57 PRs from 12
+contributors. See the <a
href="https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.3.0.md">change
log</a> for more information.</p>
+
+<h2 id="release-highlights">Release Highlights</h2>
+
+<h3 id="binary-releases">Binary Releases</h3>
+
+<p>Comet jar files are now published to Maven central for amd64 and arm64
architectures (Linux only).</p>
+
+<p>Files can be found at
https://central.sonatype.com/search?q=org.apache.datafusion</p>
+
+<ul>
+ <li>Spark versions 3.3, 3.4, and 3.5 are supported.</li>
+ <li>Scala versions 2.12 and 2.13 are supported.</li>
+</ul>
+
+<h3 id="new-features">New Features</h3>
+
+<p>The following expressions are now supported natively:</p>
+
+<ul>
+ <li><code class="language-plaintext highlighter-rouge">DateAdd</code></li>
+ <li><code class="language-plaintext highlighter-rouge">DateSub</code></li>
+ <li><code class="language-plaintext highlighter-rouge">ElementAt</code></li>
+ <li><code class="language-plaintext
highlighter-rouge">GetArrayElement</code></li>
+ <li><code class="language-plaintext highlighter-rouge">ToJson</code></li>
+</ul>
+
+<h3 id="performance--stability">Performance & Stability</h3>
+
+<ul>
+ <li>Upgraded to DataFusion 42.0.0</li>
+ <li>Reduced memory overhead due to some memory leaks being fixed</li>
+ <li>Comet will now fall back to Spark for queries that use DPP, to avoid
performance regressions because Comet does
+not have native support for DPP yet</li>
+ <li>Improved performance when converting Spark columnar data to Arrow
format</li>
+ <li>Faster decimal sum and avg functions</li>
+</ul>
+
+<h3 id="documentation-updates">Documentation Updates</h3>
+
+<ul>
+ <li>Improved documentation for deploying Comet with Kubernetes and Helm in
the <a
href="https://datafusion.apache.org/comet/user-guide/kubernetes.html">Comet
Kubernetes Guide</a></li>
+ <li>More detailed architectural overview of Comet scan and execution in the
<a
href="https://datafusion.apache.org/comet/contributor-guide/plugin_overview.html">Comet
Plugin Overview</a> in the contributor guide</li>
+</ul>
+
+<h2 id="getting-involved">Getting Involved</h2>
+
+<p>The Comet project welcomes new contributors. We use the same <a
href="https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord">Slack
and Discord</a> channels as the main DataFusion
+project.</p>
+
+<p>The easiest way to get involved is to test Comet with your current Spark
jobs and file issues for any bugs or
+performance regressions that you find. See the <a
href="https://datafusion.apache.org/comet/user-guide/installation.html">Getting
Started</a> guide for instructions on downloading and installing
+Comet.</p>
+
+<p>There are also many <a
href="https://github.com/apache/datafusion-comet/contribute">good first
issues</a> waiting for contributions.</p>
+
+
+ </div><a class="u-url" href="/blog/2024/09/27/datafusion-comet-0.3.0/"
hidden></a>
+</article>
+
+ </div>
+ </main><footer class="site-footer h-card">
+ <data class="u-url" href="/blog/"></data>
+
+ <div class="wrapper">
+
+ <h2 class="footer-heading">Apache DataFusion Project News & Blog</h2>
+
+ <div class="footer-col-wrapper">
+ <div class="footer-col footer-col-1">
+ <ul class="contact-list">
+ <li class="p-name">Apache DataFusion Project News &
Blog</li><li><a class="u-email"
href="mailto:[email protected]">[email protected]</a></li></ul>
+ </div>
+
+ <div class="footer-col footer-col-2"><ul
class="social-media-list"><li><a href="https://github.com/apache"><svg
class="svg-icon"><use
xlink:href="/blog/assets/minima-social-icons.svg#github"></use></svg> <span
class="username">apache</span></a></li><li><a
href="https://www.twitter.com/ApacheDataFusio"><svg class="svg-icon"><use
xlink:href="/blog/assets/minima-social-icons.svg#twitter"></use></svg> <span
class="username">ApacheDataFusio</span></a></li></ul>
+</div>
+
+ <div class="footer-col footer-col-3">
+ <p>Apache DataFusion is a very fast, extensible query engine for
building high-quality data-centric systems in Rust, using the Apache Arrow
in-memory format.</p>
+ </div>
+ </div>
+
+ </div>
+
+</footer>
+</body>
+
+</html>
diff --git a/feed.xml b/feed.xml
index c542e2a..a48659e 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,72 @@
-<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.3.3">Jekyll</generator><link
href="https://datafusion.apache.org/blog/feed.xml" rel="self"
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"
rel="alternate" type="text/html"
/><updated>2024-10-01T19:55:17+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
type="html">Apache DataFusion Project News &amp; [...]
+<?xml version="1.0" encoding="utf-8"?><feed
xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/"
version="4.3.3">Jekyll</generator><link
href="https://datafusion.apache.org/blog/feed.xml" rel="self"
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"
rel="alternate" type="text/html"
/><updated>2024-10-08T02:34:20+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
type="html">Apache DataFusion Project News &amp; [...]
+
+-->
+
+<p>The Apache DataFusion PMC is pleased to announce version 0.3.0 of the <a
href="https://datafusion.apache.org/comet/">Comet</a> subproject.</p>
+
+<p>Comet is an accelerator for Apache Spark that translates Spark physical
plans to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.</p>
+
+<p>Comet runs on commodity hardware and aims to provide 100% compatibility
with Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless
explicitly enabled by the user. Refer
+to the <a
href="https://datafusion.apache.org/comet/user-guide/compatibility.html">compatibility
guide</a> for more information.</p>
+
+<p>This release covers approximately four weeks of development work and is the
result of merging 57 PRs from 12
+contributors. See the <a
href="https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.3.0.md">change
log</a> for more information.</p>
+
+<h2 id="release-highlights">Release Highlights</h2>
+
+<h3 id="binary-releases">Binary Releases</h3>
+
+<p>Comet jar files are now published to Maven central for amd64 and arm64
architectures (Linux only).</p>
+
+<p>Files can be found at
https://central.sonatype.com/search?q=org.apache.datafusion</p>
+
+<ul>
+ <li>Spark versions 3.3, 3.4, and 3.5 are supported.</li>
+ <li>Scala versions 2.12 and 2.13 are supported.</li>
+</ul>
+
+<h3 id="new-features">New Features</h3>
+
+<p>The following expressions are now supported natively:</p>
+
+<ul>
+ <li><code class="language-plaintext highlighter-rouge">DateAdd</code></li>
+ <li><code class="language-plaintext highlighter-rouge">DateSub</code></li>
+ <li><code class="language-plaintext highlighter-rouge">ElementAt</code></li>
+ <li><code class="language-plaintext
highlighter-rouge">GetArrayElement</code></li>
+ <li><code class="language-plaintext highlighter-rouge">ToJson</code></li>
+</ul>
+
+<h3 id="performance--stability">Performance & Stability</h3>
+
+<ul>
+ <li>Upgraded to DataFusion 42.0.0</li>
+ <li>Reduced memory overhead due to some memory leaks being fixed</li>
+ <li>Comet will now fall back to Spark for queries that use DPP, to avoid
performance regressions because Comet does
+not have native support for DPP yet</li>
+ <li>Improved performance when converting Spark columnar data to Arrow
format</li>
+ <li>Faster decimal sum and avg functions</li>
+</ul>
+
+<h3 id="documentation-updates">Documentation Updates</h3>
+
+<ul>
+ <li>Improved documentation for deploying Comet with Kubernetes and Helm in
the <a
href="https://datafusion.apache.org/comet/user-guide/kubernetes.html">Comet
Kubernetes Guide</a></li>
+ <li>More detailed architectural overview of Comet scan and execution in the
<a
href="https://datafusion.apache.org/comet/contributor-guide/plugin_overview.html">Comet
Plugin Overview</a> in the contributor guide</li>
+</ul>
+
+<h2 id="getting-involved">Getting Involved</h2>
+
+<p>The Comet project welcomes new contributors. We use the same <a
href="https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord">Slack
and Discord</a> channels as the main DataFusion
+project.</p>
+
+<p>The easiest way to get involved is to test Comet with your current Spark
jobs and file issues for any bugs or
+performance regressions that you find. See the <a
href="https://datafusion.apache.org/comet/user-guide/installation.html">Getting
Started</a> guide for instructions on downloading and installing
+Comet.</p>
+
+<p>There are also many <a
href="https://github.com/apache/datafusion-comet/contribute">good first
issues</a> waiting for
contributions.</p>]]></content><author><name>pmc</name></author><category
term="subprojects" /><summary
type="html"><![CDATA[<!–]]></summary></entry><entry><title type="html">Using
StringView / German Style Strings to Make Queries Faster: Part 1- Reading
Parquet</title><link
href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
[...]
-->
@@ -1395,381 +1463,4 @@ suitable for beginners is <a
href="https://github.com/apache/arrow-datafusion/is
meetings. Timezones are always a challenge for such meetings, but we hope to
have two calls that can work for most attendees. If you are interested
in helping, or just want to say hi, please drop us a note via one of
-the methods listed in our <a
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">Communication
Doc</a>.</p>]]></content><author><name>pmc</name></author><category
term="release" /><summary
type="html"><![CDATA[<!–]]></summary></entry><entry><title
type="html">Aggregating Millions of Groups Fast in Apache Arrow DataFusion
28.0.0</title><link
href="https://datafusion.apache.org/blog/2023/08/05/datafusion_fast_grouping/"
rel="alternate" type="text/html" title= [...]
-
--->
-
-<!-- Converted from Google Docs using
https://www.buymeacoffee.com/docstomarkdown -->
-
-<h2
id="aggregating-millions-of-groups-fast-in-apache-arrow-datafusion">Aggregating
Millions of Groups Fast in Apache Arrow DataFusion</h2>
-
-<p>Andrew Lamb, Daniël Heres, Raphael Taylor-Davies,</p>
-
-<p><em>Note: this article was originally published on the <a
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion">InfluxData
Blog</a></em></p>
-
-<h2 id="tldr">TLDR</h2>
-
-<p>Grouped aggregations are a core part of any analytic tool, creating
understandable summaries of huge data volumes. <a
href="https://arrow.apache.org/datafusion/">Apache Arrow DataFusion</a>’s
parallel aggregation capability is 2-3x faster in the <a
href="https://crates.io/crates/datafusion/28.0.0">newly released version <code
class="language-plaintext highlighter-rouge">28.0.0</code></a> for queries with
a large number (10,000 or more) of groups.</p>
-
-<p>Improving aggregation performance matters to all users of DataFusion. For
example, both InfluxDB, a <a href="https://github.com/influxdata/influxdb">time
series data platform</a> and Coralogix, a <a
href="https://coralogix.com/?utm_source=InfluxDB&utm_medium=Blog&utm_campaign=organic">full-stack
observability</a> platform, aggregate vast amounts of raw data to monitor and
create insights for our customers. Improving DataFusion’s performance lets us
provide better user experien [...]
-
-<p>With the new optimizations, DataFusion’s grouping speed is now close to
DuckDB, a system that regularly reports <a
href="https://duckdblabs.github.io/db-benchmark/">great</a> <a
href="https://duckdb.org/2022/03/07/aggregate-hashtable.html#experiments">grouping</a>
benchmark performance numbers. Figure 1 contains a representative sample of <a
href="https://github.com/ClickHouse/ClickBench/tree/main">ClickBench</a> on a
single Parquet file, and the full results are at the end of this ar [...]
-
-<p><img src="/blog/assets/datafusion_fast_grouping/summary.png" width="700"
/></p>
-
-<p><strong>Figure 1</strong>: Query performance for ClickBench queries on
queries 16, 17, 18 and 19 on a single Parquet file for DataFusion <code
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code
class="language-plaintext highlighter-rouge">28.0.0</code> and DuckDB <code
class="language-plaintext highlighter-rouge">0.8.1</code>.</p>
-
-<h2 id="introduction-to-high-cardinality-grouping">Introduction to high
cardinality grouping</h2>
-
-<p>Aggregation is a fancy word for computing summary statistics across many
rows that have the same value in one or more columns. We call the rows with the
same values <em>groups</em> and “high cardinality” means there are a large
number of distinct groups in the dataset. At the time of writing, a “large”
number of groups in analytic engines is around 10,000.</p>
-
-<p>For example the <a
href="https://github.com/ClickHouse/ClickBench">ClickBench</a> <em>hits</em>
dataset contains 100 million anonymized user clicks across a set of websites.
ClickBench Query 17 is:</p>
-
-<div class="language-sql highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="k">SELECT</span> <span
class="nv">"UserID"</span><span class="p">,</span> <span
class="nv">"SearchPhrase"</span><span class="p">,</span> <span
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span
class="p">)</span>
-<span class="k">FROM</span> <span class="n">hits</span>
-<span class="k">GROUP</span> <span class="k">BY</span> <span
class="nv">"UserID"</span><span class="p">,</span> <span
class="nv">"SearchPhrase"</span>
-<span class="k">ORDER</span> <span class="k">BY</span> <span
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span
class="p">)</span>
-<span class="k">DESC</span> <span class="k">LIMIT</span> <span
class="mi">10</span><span class="p">;</span>
-</code></pre></div></div>
-
-<p>In English, this query finds “the top ten (user, search phrase)
combinations, across all clicks” and produces the following results (there are
no search phrases for the top ten users):</p>
-
-<div class="language-text highlighter-rouge"><div class="highlight"><pre
class="highlight"><code>+---------------------+--------------+-----------------+
-| UserID | SearchPhrase | COUNT(UInt8(1)) |
-+---------------------+--------------+-----------------+
-| 1313338681122956954 | | 29097 |
-| 1907779576417363396 | | 25333 |
-| 2305303682471783379 | | 10597 |
-| 7982623143712728547 | | 6669 |
-| 7280399273658728997 | | 6408 |
-| 1090981537032625727 | | 6196 |
-| 5730251990344211405 | | 6019 |
-| 6018350421959114808 | | 5990 |
-| 835157184735512989 | | 5209 |
-| 770542365400669095 | | 4906 |
-+---------------------+--------------+-----------------+
-</code></pre></div></div>
-
-<p>The ClickBench dataset contains</p>
-
-<ul>
- <li>99,997,497 total rows<sup id="fnref:1" role="doc-noteref"><a
href="#fn:1" class="footnote" rel="footnote">1</a></sup></li>
- <li>17,630,976 different users (distinct UserIDs)<sup id="fnref:2"
role="doc-noteref"><a href="#fn:2" class="footnote"
rel="footnote">2</a></sup></li>
- <li>6,019,103 different search phrases<sup id="fnref:3"
role="doc-noteref"><a href="#fn:3" class="footnote"
rel="footnote">3</a></sup></li>
- <li>24,070,560 distinct combinations<sup id="fnref:4" role="doc-noteref"><a
href="#fn:4" class="footnote" rel="footnote">4</a></sup> of (UserID,
SearchPhrase)
-Thus, to answer the query, DataFusion must map each of the 100M different
input rows into one of the <strong>24 million different groups</strong>, and
keep count of how many such rows there are in each group.</li>
-</ul>
-
-<h2 id="the-solution">The solution</h2>
-
-<p>Like most concepts in databases and other analytic systems, the basic ideas
of this algorithm are straightforward and taught in introductory computer
science courses. You could compute the query with a program such as this<sup
id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote"
rel="footnote">5</a></sup>:</p>
-
-<div class="language-python highlighter-rouge"><div class="highlight"><pre
class="highlight"><code><span class="kn">import</span> <span
class="n">pandas</span> <span class="k">as</span> <span class="n">pd</span>
-<span class="kn">from</span> <span class="n">collections</span> <span
class="kn">import</span> <span class="n">defaultdict</span>
-<span class="kn">from</span> <span class="n">operator</span> <span
class="kn">import</span> <span class="n">itemgetter</span>
-
-<span class="c1"># read file
-</span><span class="n">hits</span> <span class="o">=</span> <span
class="n">pd</span><span class="p">.</span><span
class="nf">read_parquet</span><span class="p">(</span><span
class="sh">'</span><span class="s">hits.parquet</span><span
class="sh">'</span><span class="p">,</span> <span class="n">engine</span><span
class="o">=</span><span class="sh">'</span><span class="s">pyarrow</span><span
class="sh">'</span><span class="p">)</span>
-
-<span class="c1"># build groups
-</span><span class="n">counts</span> <span class="o">=</span> <span
class="nf">defaultdict</span><span class="p">(</span><span
class="nb">int</span><span class="p">)</span>
-<span class="k">for</span> <span class="n">index</span><span
class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span
class="n">hits</span><span class="p">.</span><span
class="nf">iterrows</span><span class="p">():</span>
- <span class="n">group</span> <span class="o">=</span> <span
class="p">(</span><span class="n">row</span><span class="p">[</span><span
class="sh">'</span><span class="s">UserID</span><span class="sh">'</span><span
class="p">],</span> <span class="n">row</span><span class="p">[</span><span
class="sh">'</span><span class="s">SearchPhrase</span><span
class="sh">'</span><span class="p">]);</span>
- <span class="c1"># update the dict entry for the corresponding key
-</span> <span class="n">counts</span><span class="p">[</span><span
class="n">group</span><span class="p">]</span> <span class="o">+=</span> <span
class="mi">1</span>
-
-<span class="c1"># Print the top 10 values
-</span><span class="nf">print </span><span class="p">(</span><span
class="nf">dict</span><span class="p">(</span><span
class="nf">sorted</span><span class="p">(</span><span
class="n">counts</span><span class="p">.</span><span
class="nf">items</span><span class="p">(),</span> <span
class="n">key</span><span class="o">=</span><span
class="nf">itemgetter</span><span class="p">(</span><span
class="mi">1</span><span class="p">),</span> <span
class="n">reverse</span><span class="o">=</span><sp [...]
-</code></pre></div></div>
-
-<p>This approach, while simple, is both slow and very memory inefficient. It
requires over 40 seconds to compute the results for less than 1% of the
dataset<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote"
rel="footnote">6</a></sup>. Both DataFusion <code class="language-plaintext
highlighter-rouge">28.0.0</code> and DuckDB <code class="language-plaintext
highlighter-rouge">0.8.1</code> compute results in under 10 seconds for the
<em>entire</em> dataset.</p>
-
-<p>To answer this query quickly and efficiently, you have to write your code
such that it:</p>
-
-<ol>
- <li>Keeps all cores busy aggregating via parallelized computation</li>
- <li>Updates aggregate values quickly, using vectorizable loops that are easy
for compilers to translate into the high performance <a
href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data">SIMD</a>
instructions available in modern CPUs.</li>
-</ol>
-
-<p>The rest of this article explains how grouping works in DataFusion and the
improvements we made in <code class="language-plaintext
highlighter-rouge">28.0.0</code>.</p>
-
-<h3 id="two-phase-parallel-partitioned-grouping">Two phase parallel
partitioned grouping</h3>
-
-<p>Both DataFusion <code class="language-plaintext
highlighter-rouge">27.0.</code> and <code class="language-plaintext
highlighter-rouge">28.0.0</code> use state-of-the-art, two phase parallel hash
partitioned grouping, similar to other high-performance vectorized engines like
<a href="https://duckdb.org/2022/03/07/aggregate-hashtable.html">DuckDB’s
Parallel Grouped Aggregates</a>. In pictures this looks like:</p>
-
-<div class="language-text highlighter-rouge"><div class="highlight"><pre
class="highlight"><code> ▲ ▲
- │ │
- │ │
- │ │
-┌───────────────────────┐ ┌───────────────────┐
-│ GroupBy │ │ GroupBy │ Step 4
-│ (Final) │ │ (Final) │
-└───────────────────────┘ └───────────────────┘
- ▲ ▲
- │ │
- └────────────┬───────────┘
- │
- │
- ┌─────────────────────────┐
- │ Repartition │ Step 3
- │ HASH(x) │
- └─────────────────────────┘
- ▲
- │
- ┌────────────┴──────────┐
- │ │
- │ │
- ┌────────────────────┐ ┌─────────────────────┐
- │ GroupyBy │ │ GroupBy │ Step 2
- │ (Partial) │ │ (Partial) │
- └────────────────────┘ └─────────────────────┘
- ▲ ▲
- ┌──┘ └─┐
- │ │
- .─────────. .─────────.
- ,─' '─. ,─' '─.
-; Input : ; Input : Step 1
-: Stream 1 ; : Stream 2 ;
- ╲ ╱ ╲ ╱
- '─. ,─' '─. ,─'
- `───────' `───────'
-</code></pre></div></div>
-
-<p><strong>Figure 2</strong>: Two phase repartitioned grouping: data flows
from bottom (source) to top (results) in two phases. First (Steps 1 and 2),
each core reads the data into a core-specific hash table, computing
intermediate aggregates without any cross-core coordination. Then (Steps 3 and
4) DataFusion divides the data (“repartitions”) into distinct subsets by group
value, and each subset is sent to a specific core which computes the final
aggregate.</p>
-
-<p>The two phases are critical for keeping cores busy in a multi-core system.
Both phases use the same hash table approach (explained in the next section),
but differ in how the groups are distributed and the partial results emitted
from the accumulators. The first phase aggregates data as soon as possible
after it is produced. However, as shown in Figure 2, the groups can be anywhere
in any input, so the same group is often found on many different cores. The
second phase uses a hash fun [...]
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre
class="highlight"><code> ┌─────┐ ┌─────┐
- │ 1 │ │ 3 │
- │ 2 │ │ 4 │ 2. After Repartitioning: each
- └─────┘ └─────┘ group key appears in exactly
- ┌─────┐ ┌─────┐ one partition
- │ 1 │ │ 3 │
- │ 2 │ │ 4 │
- └─────┘ └─────┘
-
-─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
-
- ┌─────┐ ┌─────┐
- │ 2 │ │ 2 │
- │ 1 │ │ 2 │
- │ 3 │ │ 3 │
- │ 4 │ │ 1 │
- └─────┘ └─────┘ 1. Input Stream: groups
- ... ... values are spread
- ┌─────┐ ┌─────┐ arbitrarily over each input
- │ 1 │ │ 4 │
- │ 4 │ │ 3 │
- │ 1 │ │ 1 │
- │ 4 │ │ 3 │
- │ 3 │ │ 2 │
- │ 2 │ │ 2 │
- │ 2 │ └─────┘
- └─────┘
-
- Core A Core B
-
-</code></pre></div></div>
-
-<p><strong>Figure 3</strong>: Group value distribution across 2 cores during
aggregation phases. In the first phase, every group value <code
class="language-plaintext highlighter-rouge">1</code>, <code
class="language-plaintext highlighter-rouge">2</code>, <code
class="language-plaintext highlighter-rouge">3</code>, <code
class="language-plaintext highlighter-rouge">4</code>, is present in the input
stream processed by each core. In the second phase, after repartitioning, the
group value [...]
-
-<p>There are some additional subtleties in the <a
href="https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_plan/aggregates/row_hash.rs">DataFusion
implementation</a> not mentioned above due to space constraints, such as:</p>
-
-<ol>
- <li>The policy of when to emit data from the first phase’s hash table (e.g.
because the data is partially sorted)</li>
- <li>Handling specific filters per aggregate (due to the <code
class="language-plaintext highlighter-rouge">FILTER</code> SQL clause)</li>
- <li>Data types of intermediate values (which may not be the same as the
final output for some aggregates such as <code class="language-plaintext
highlighter-rouge">AVG</code>).</li>
- <li>Action taken when memory use exceeds its budget.</li>
-</ol>
-
-<h3 id="hash-grouping">Hash grouping</h3>
-
-<p>DataFusion queries can compute many different aggregate functions for each
group, both <a
href="https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html">built
in</a> and/or user defined <a
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.AggregateUDF.html"><code
class="language-plaintext highlighter-rouge">AggregateUDFs</code></a>. The
state for each aggregate function, called an <em>accumulator</em>, is tracked
with a hash table (DataFusion u [...]
-
-<h3 id="hash-grouping-in-2700">Hash grouping in <code
class="language-plaintext highlighter-rouge">27.0.0</code></h3>
-
-<p>As shown in Figure 3, DataFusion <code class="language-plaintext
highlighter-rouge">27.0.0</code> stores the data in a <a
href="https://github.com/apache/arrow-datafusion/blob/4d93b6a3802151865b68967bdc4c7d7ef425b49a/datafusion/core/src/physical_plan/aggregates/utils.rs#L38-L50"><code
class="language-plaintext highlighter-rouge">GroupState</code></a> structure
which, unsurprisingly, tracks the state for each group. The state for each
group consists of:</p>
-
-<ol>
- <li>The actual value of the group columns, in <a
href="https://docs.rs/arrow-row/latest/arrow_row/index.html">Arrow Row</a>
format.</li>
- <li>In-progress accumulations (e.g. the running counts for the <code
class="language-plaintext highlighter-rouge">COUNT</code> aggregate) for each
group, in one of two possible formats (<a
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/expr/src/accumulator.rs#L24-L49"><code
class="language-plaintext highlighter-rouge">Accumulator</code></a> or <a
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b42
[...]
- <li>Scratch space for tracking which rows match each aggregate in each
batch.</li>
-</ol>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre
class="highlight"><code>
┌──────────────────────────────────────┐
- │ │
- │ ... │
- │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
- │ ┃ ┃ │
- ┌─────────┐ │ ┃ ┌──────────────────────────────┐ ┃ │
- │ │ │ ┃ │group values: OwnedRow │ ┃ │
- │ ┌─────┐ │ │ ┃ └──────────────────────────────┘ ┃ │
- │ │ 5 │ │ │ ┃ ┌──────────────────────────────┐ ┃ │
- │ ├─────┤ │ │ ┃ │Row accumulator: │ ┃ │
- │ │ 9 │─┼────┐ │ ┃ │Vec<u8> │ ┃ │
- │ ├─────┤ │ │ │ ┃ └──────────────────────────────┘ ┃ │
- │ │ ... │ │ │ │ ┃ ┌──────────────────────┐ ┃ │
- │ ├─────┤ │ │ │ ┃ │┌──────────────┐ │ ┃ │
- │ │ 1 │ │ │ │ ┃ ││Accumulator 1 │ │ ┃ │
- │ ├─────┤ │ │ │ ┃ │└──────────────┘ │ ┃ │
- │ │ ... │ │ │ │ ┃ │┌──────────────┐ │ ┃ │
- │ └─────┘ │ │ │ ┃ ││Accumulator 2 │ │ ┃ │
- │ │ │ │ ┃ │└──────────────┘ │ ┃ │
- └─────────┘ │ │ ┃ │ Box<dyn Accumulator> │ ┃ │
- Hash Table │ │ ┃ └──────────────────────┘ ┃ │
- │ │ ┃ ┌─────────────────────────┐ ┃ │
- │ │ ┃ │scratch indices: Vec<u32>│ ┃ │
- │ │ ┃ └─────────────────────────┘ ┃ │
- │ │ ┃ GroupState ┃ │
- └─────▶ │ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ │
- │ │
- Hash table tracks an │ ... │
- index into group_states │ │
- └──────────────────────────────────────┘
- group_states: Vec<GroupState>
-
- There is one GroupState PER GROUP
-
-</code></pre></div></div>
-
-<p><strong>Figure 4</strong>: Hash group operator structure in DataFusion
<code class="language-plaintext highlighter-rouge">27.0.0</code>. A hash table
maps each group to a GroupState which contains all the per-group states.</p>
-
-<p>To compute the aggregate, DataFusion performs the following steps for each
input batch:</p>
-
-<ol>
- <li>Calculate hash using <a
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/hash_utils.rs#L264-L307">efficient
vectorized code</a>, specialized for each data type.</li>
- <li>Determine group indexes for each input row using the hash table
(creating new entries for newly seen groups).</li>
- <li><a
href="https://github.com/apache/arrow-datafusion/blob/4ab8be57dee3bfa72dd105fbd7b8901b873a4878/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L562-L602">Update
Accumulators for each group that had input rows,</a> assembling the rows into
a contiguous range for vectorized accumulator if there are a sufficient number
of them.</li>
-</ol>
-
-<p>DataFusion also stores the hash values in the table to avoid potentially
costly hash recomputation when resizing the hash table.</p>
-
-<p>This scheme works very well for a relatively small number of distinct
groups: all accumulators are efficiently updated with large contiguous batches
of rows.</p>
-
-<p>However, this scheme is not ideal for high cardinality grouping due to:</p>
-
-<ol>
- <li><strong>Multiple allocations per group</strong> for the group value row
format, as well as for the <code class="language-plaintext
highlighter-rouge">RowAccumulator</code>s and each <code
class="language-plaintext highlighter-rouge">Accumulator</code>. The <code
class="language-plaintext highlighter-rouge">Accumulator</code> may have
additional allocations within it as well.</li>
- <li><strong>Non-vectorized updates:</strong> Accumulator updates often fall
back to a slower non-vectorized form because the number of distinct groups is
large (and thus number of values per group is small) in each input batch.</li>
-</ol>
-
-<h3 id="hash-grouping-in-2800">Hash grouping in <code
class="language-plaintext highlighter-rouge">28.0.0</code></h3>
-
-<p>For <code class="language-plaintext highlighter-rouge">28.0.0</code>, we
rewrote the core group by implementation following traditional system
optimization principles: fewer allocations, type specialization, and aggressive
vectorization.</p>
-
-<p>DataFusion <code class="language-plaintext highlighter-rouge">28.0.0</code>
uses the same RawTable and still stores group indexes. The major differences,
as shown in Figure 4, are:</p>
-
-<ol>
- <li>Group values are stored either
- <ol>
- <li>Inline in the <code class="language-plaintext
highlighter-rouge">RawTable</code> (for single columns of primitive types),
where the conversion to Row format costs more than its benefit</li>
- <li>In a separate <a
href="https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html">Rows</a>
structure with a single contiguous allocation for all groups values, rather
than an allocation per group. Accumulators manage the state for all the groups
internally, so the code to update intermediate values is a tight type
specialized loop. The new <a
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/gr
[...]
- </ol>
- </li>
-</ol>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre
class="highlight"><code>┌───────────────────────────────────┐
┌───────────────────────┐
-│ ┌ ─ ─ ─ ─ ─ ┐ ┌─────────────────┐│ │ ┏━━━━━━━━━━━━━━━━━━━┓ │
-│ │ ││ │ ┃ ┌──────────────┐ ┃ │
-│ │ │ │ ┌ ─ ─ ┐┌─────┐ ││ │ ┃ │┌───────────┐ │ ┃ │
-│ │ X │ 5 │ ││ │ ┃ ││ value1 │ │ ┃ │
-│ │ │ │ ├ ─ ─ ┤├─────┤ ││ │ ┃ │└───────────┘ │ ┃ │
-│ │ Q │ 9 │──┼┼──┐ │ ┃ │ ... │ ┃ │
-│ │ │ │ ├ ─ ─ ┤├─────┤ ││ └──┼─╋─▶│ │ ┃ │
-│ │ ... │ ... │ ││ │ ┃ │┌───────────┐ │ ┃ │
-│ │ │ │ ├ ─ ─ ┤├─────┤ ││ │ ┃ ││ valueN │ │ ┃ │
-│ │ H │ 1 │ ││ │ ┃ │└───────────┘ │ ┃ │
-│ │ │ │ ├ ─ ─ ┤├─────┤ ││ │ ┃ │values: Vec<T>│ ┃ │
-│ Rows │ ... │ ... │ ││ │ ┃ └──────────────┘ ┃ │
-│ │ │ │ └ ─ ─ ┘└─────┘ ││ │ ┃ ┃ │
-│ ─ ─ ─ ─ ─ ─ │ ││ │ ┃ GroupsAccumulator ┃ │
-│ └─────────────────┘│ │ ┗━━━━━━━━━━━━━━━━━━━┛ │
-│ Hash Table │ │ │
-│ │ │ ... │
-└───────────────────────────────────┘ └───────────────────────┘
- GroupState Accumulators
-
-
-Hash table value stores group_indexes One GroupsAccumulator
-and group values. per aggregate. Each
- stores the state for
-Group values are stored either inline *ALL* groups, typically
-in the hash table or in a single using a native Vec<T>
-allocation using the arrow Row format
-</code></pre></div></div>
-
-<p><strong>Figure 5</strong>: Hash group operator structure in DataFusion
<code class="language-plaintext highlighter-rouge">28.0.0</code>. Group values
are stored either directly in the hash table, or in a single allocation using
the arrow Row format. The hash table contains group indexes. A single <code
class="language-plaintext highlighter-rouge">GroupsAccumulator</code> stores
the per-aggregate state for <em>all</em> groups.</p>
-
-<p>This new structure improves performance significantly for high cardinality
groups due to:</p>
-
-<ol>
- <li><strong>Reduced allocations</strong>: There are no longer any individual
allocations per group.</li>
- <li><strong>Contiguous native accumulator states</strong>: Type-specialized
accumulators store the values for all groups in a single contiguous allocation
using a <a href="https://doc.rust-lang.org/std/vec/struct.Vec.html">Rust
Vec<T></a> of some native type.</li>
- <li><strong>Vectorized state update</strong>: The inner aggregate update
loops, which are type-specialized and in terms of native <code
class="language-plaintext highlighter-rouge">Vec</code>s, are well-vectorized
by the Rust compiler (thanks <a href="https://llvm.org/">LLVM</a>!).</li>
-</ol>
-
-<h3 id="notes">Notes</h3>
-
-<p>Some vectorized grouping implementations store the accumulator state
row-wise directly in the hash table, which often uses modern CPU caches
efficiently. Managing accumulator state in columnar fashion may sacrifice some
cache locality, however it ensures the size of the hash table remains small,
even when there are large numbers of groups and aggregates, making it easier
for the compiler to vectorize the accumulator update.</p>
-
-<p>Depending on the cost of recomputing hash values, DataFusion <code
class="language-plaintext highlighter-rouge">28.0.0</code> may or may not store
the hash values in the table. This optimizes the tradeoff between the cost of
computing the hash value (which is expensive for strings, for example) vs. the
cost of storing it in the hash table.</p>
-
-<p>One subtlety that arises from pushing state updates into GroupsAccumulators
is that each accumulator must handle similar variations with/without filtering
and with/without nulls in the input. DataFusion <code class="language-plaintext
highlighter-rouge">28.0.0</code> uses a templated <a
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/groups_accumulator/accumulate.rs#L28-L54"><code
class="language-pla [...]
-
-<p>The code structure is heavily influenced by the fact DataFusion is
implemented using <a href="https://www.rust-lang.org/">Rust</a>, a new(ish)
systems programming language focused on speed and safety. Rust heavily
discourages many of the traditional pointer casting “tricks” used in C/C++ hash
grouping implementations. The DataFusion aggregation code is almost entirely <a
href="https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html#:~:text=Safe%20Rust%20is%20the%20true,Undefined%2
[...]
-
-<h2 id="clickbench-results">ClickBench results</h2>
-
-<p>The full results of running the <a
href="https://github.com/ClickHouse/ClickBench/tree/main">ClickBench</a>
queries against the single Parquet file with DataFusion <code
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code
class="language-plaintext highlighter-rouge">28.0.0</code>, and DuckDB <code
class="language-plaintext highlighter-rouge">0.8.1</code> are below. These
numbers were run on a GCP <code class="language-plaintext
highlighter-rouge">e2-standard-8 [...]
-
-<p>As the industry moves towards data systems assembled from components, it is
increasingly important that they exchange data using open standards such as <a
href="https://arrow.apache.org/">Apache Arrow</a> and <a
href="https://parquet.apache.org/">Parquet</a> rather than custom storage and
in-memory formats. Thus, this benchmark uses a single input Parquet file
representative of many DataFusion users and aligned with the current trend in
analytics of avoiding a costly load/transformati [...]
-
-<p>DataFusion now reaches near-DuckDB-speeds querying Parquet data. While we
don’t plan to engage in a benchmarking shootout with a team that literally
wrote <a href="https://dl.acm.org/doi/abs/10.1145/3209950.3209955">Fair
Benchmarking Considered Difficult</a>, hopefully everyone can agree that
DataFusion <code class="language-plaintext highlighter-rouge">28.0.0</code> is
a significant improvement.</p>
-
-<p><img src="/blog/assets/datafusion_fast_grouping/full.png" width="700" /></p>
-
-<p><strong>Figure 6</strong>: Performance of DataFusion <code
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code
class="language-plaintext highlighter-rouge">28.0.0</code>, and DuckDB <code
class="language-plaintext highlighter-rouge">0.8.1</code> on all 43 ClickBench
queries against a single <code class="language-plaintext
highlighter-rouge">hits.parquet</code> file. Lower is better.</p>
-
-<h3 id="notes-1">Notes</h3>
-
-<p>DataFusion <code class="language-plaintext highlighter-rouge">27.0.0</code>
was not able to run several queries due to either planner bugs (Q9, Q11, Q12,
14) or running out of memory (Q33). DataFusion <code class="language-plaintext
highlighter-rouge">28.0.0</code> solves those issues.</p>
-
-<p>DataFusion is faster than DuckDB for query 21 and 22, likely due to
optimized implementations of string pattern matching.</p>
-
-<h2 id="conclusion-performance-matters">Conclusion: performance matters</h2>
-
-<p>Improving aggregation performance by more than a factor of two allows
developers building products and projects with DataFusion to spend more time on
value-added domain specific features. We believe building systems with
DataFusion is much faster than trying to build something similar from scratch.
DataFusion increases productivity because it eliminates the need to rebuild
well-understood, but costly to implement, analytic database technology. While
we’re pleased with the improvements [...]
-
-<h2 id="acknowledgments">Acknowledgments</h2>
-
-<p>DataFusion is a <a
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">community
effort</a> and this work was not possible without contributions from many in
the community. A special shout out to <a
href="https://github.com/sunchao">sunchao</a>, <a
href="https://github.com/jyshen">yjshen</a>, <a
href="https://github.com/yahoNanJing">yahoNanJing</a>, <a
href="https://github.com/mingmwang">mingmwang</a>, <a
href="https://github.com/ozankabak">ozankabak</a>, < [...]
-
-<h2 id="about-datafusion">About DataFusion</h2>
-
-<p><a href="https://arrow.apache.org/datafusion/">Apache Arrow DataFusion</a>
is an extensible query engine and database toolkit, written in <a
href="https://www.rust-lang.org/">Rust</a>, that uses <a
href="https://arrow.apache.org/">Apache Arrow</a> as its in-memory format.
DataFusion, along with <a href="https://calcite.apache.org/">Apache
Calcite</a>, Facebook’s <a
href="https://github.com/facebookincubator/velox">Velox</a>, and similar
technology are part of the next generation “<a h [...]
-
-<!-- Footnotes themselves at the bottom. -->
-<h2 id="notes-2">Notes</h2>
-
-<div class="footnotes" role="doc-endnotes">
- <ol>
- <li id="fn:1" role="doc-endnote">
- <p><code class="language-plaintext highlighter-rouge">SELECT COUNT(*)
FROM 'hits.parquet';</code> <a href="#fnref:1" class="reversefootnote"
role="doc-backlink">↩</a></p>
- </li>
- <li id="fn:2" role="doc-endnote">
- <p><code class="language-plaintext highlighter-rouge">SELECT
COUNT(DISTINCT "UserID") as num_users FROM 'hits.parquet';</code> <a
href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
- </li>
- <li id="fn:3" role="doc-endnote">
- <p><code class="language-plaintext highlighter-rouge">SELECT
COUNT(DISTINCT "SearchPhrase") as num_phrases FROM 'hits.parquet';</code> <a
href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
- </li>
- <li id="fn:4" role="doc-endnote">
- <p><code class="language-plaintext highlighter-rouge">SELECT COUNT(*)
FROM (SELECT DISTINCT "UserID", "SearchPhrase" FROM 'hits.parquet')</code> <a
href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
- </li>
- <li id="fn:5" role="doc-endnote">
- <p>Full script at <a
href="https://github.com/alamb/datafusion-duckdb-benchmark/blob/main/hash.py">hash.py</a>
<a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
- </li>
- <li id="fn:6" role="doc-endnote">
- <p><a
href="https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_%7B%7D.parquet">hits_0.parquet</a>,
one of the files from the partitioned ClickBench dataset, which has <code
class="language-plaintext highlighter-rouge">100,000</code> rows and is 117 MB
in size. The entire dataset has <code class="language-plaintext
highlighter-rouge">100,000,000</code> rows in a single 14 GB Parquet file. The
script did not complete on the entire dataset after 40 minutes, and us [...]
- </li>
- </ol>
-</div>]]></content><author><name>alamb, Dandandan,
tustvold</name></author><category term="release" /><summary
type="html"><![CDATA[<!–]]></summary></entry></feed>
\ No newline at end of file
+the methods listed in our <a
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">Communication
Doc</a>.</p>]]></content><author><name>pmc</name></author><category
term="release" /><summary
type="html"><![CDATA[<!–]]></summary></entry></feed>
\ No newline at end of file
diff --git a/index.html b/index.html
index 7b3aeee..0d1617e 100644
--- a/index.html
+++ b/index.html
@@ -38,7 +38,12 @@
<div class="wrapper">
<div class="home">
<h2 class="post-list-heading">Posts</h2>
- <ul class="post-list"><li><span class="post-meta">Sep 13, 2024</span>
+ <ul class="post-list"><li><span class="post-meta">Sep 27, 2024</span>
+ <h3>
+ <a class="post-link" href="/blog/2024/09/27/datafusion-comet-0.3.0/">
+ Apache DataFusion Comet 0.3.0 Release
+ </a>
+ </h3></li><li><span class="post-meta">Sep 13, 2024</span>
<h3>
<a class="post-link"
href="/blog/2024/09/13/string-view-german-style-strings-part-2/">
Using StringView / German Style Strings to make Queries Faster:
Part 2 - String Operations
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]