(datafusion-site) branch asf-site updated: Comet 0.3.0 blog post (#30)

agrove Mon, 07 Oct 2024 20:19:14 -0700

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 0466888  Comet 0.3.0 blog post (#30)
0466888 is described below

commit 0466888c1ecae6edaacabd5f6424390f9beab179
Author: Andy Grove <[email protected]>
AuthorDate: Mon Oct 7 20:37:06 2024 -0600

    Comet 0.3.0 blog post (#30)
---
 2024/09/27/datafusion-comet-0.3.0/index.html | 152 +++++++++
 feed.xml                                     | 449 +++++----------------------
 index.html                                   |   7 +-
 3 files changed, 228 insertions(+), 380 deletions(-)

diff --git a/2024/09/27/datafusion-comet-0.3.0/index.html 
b/2024/09/27/datafusion-comet-0.3.0/index.html
new file mode 100644
index 0000000..953011f
--- /dev/null
+++ b/2024/09/27/datafusion-comet-0.3.0/index.html
@@ -0,0 +1,152 @@
+<!DOCTYPE html>
+<html lang="en"><head>
+  <meta charset="utf-8">
+  <meta http-equiv="X-UA-Compatible" content="IE=edge">
+  <meta name="viewport" content="width=device-width, initial-scale=1"><!-- 
Begin Jekyll SEO tag v2.8.0 -->
+<title>Apache DataFusion Comet 0.3.0 Release | Apache DataFusion Project News 
&amp; Blog</title>
+<meta name="generator" content="Jekyll v4.3.3" />
+<meta property="og:title" content="Apache DataFusion Comet 0.3.0 Release" />
+<meta name="author" content="pmc" />
+<meta property="og:locale" content="en_US" />
+<meta name="description" content="&lt;!–" />
+<meta property="og:description" content="&lt;!–" />
+<link rel="canonical" 
href="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"; />
+<meta property="og:url" 
content="https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"; 
/>
+<meta property="og:site_name" content="Apache DataFusion Project News &amp; 
Blog" />
+<meta property="og:type" content="article" />
+<meta property="article:published_time" content="2024-09-27T00:00:00+00:00" />
+<meta name="twitter:card" content="summary" />
+<meta property="twitter:title" content="Apache DataFusion Comet 0.3.0 Release" 
/>
+<script type="application/ld+json">
+{"@context":"https://schema.org","@type":"BlogPosting","author":{"@type":"Person","name":"pmc"},"dateModified":"2024-09-27T00:00:00+00:00","datePublished":"2024-09-27T00:00:00+00:00","description":"&lt;!–","headline":"Apache
 DataFusion Comet 0.3.0 
Release","mainEntityOfPage":{"@type":"WebPage","@id":"https://datafusion.apache.org/blog/2024/09/27/datafusion-comet-0.3.0/"},"publisher":{"@type":"Organization","logo":{"@type":"ImageObject","url":"https://datafusion.apache.org/blog/img/2x_bgw
 [...]
+<!-- End Jekyll SEO tag -->
+<link rel="stylesheet" href="/blog/assets/main.css"><link 
type="application/atom+xml" rel="alternate" 
href="https://datafusion.apache.org/blog/feed.xml"; title="Apache DataFusion 
Project News &amp; Blog" /></head>
+<body><header class="site-header" role="banner">
+
+  <div class="wrapper"><a class="site-title" rel="author" href="/blog/">Apache 
DataFusion Project News &amp; Blog</a><nav class="site-nav">
+        <input type="checkbox" id="nav-trigger" class="nav-trigger" />
+        <label for="nav-trigger">
+          <span class="menu-icon">
+            <svg viewBox="0 0 18 15" width="18px" height="15px">
+              <path 
d="M18,1.484c0,0.82-0.665,1.484-1.484,1.484H1.484C0.665,2.969,0,2.304,0,1.484l0,0C0,0.665,0.665,0,1.484,0
 h15.032C17.335,0,18,0.665,18,1.484L18,1.484z 
M18,7.516C18,8.335,17.335,9,16.516,9H1.484C0.665,9,0,8.335,0,7.516l0,0 
c0-0.82,0.665-1.484,1.484-1.484h15.032C17.335,6.031,18,6.696,18,7.516L18,7.516z 
M18,13.516C18,14.335,17.335,15,16.516,15H1.484 
C0.665,15,0,14.335,0,13.516l0,0c0-0.82,0.665-1.483,1.484-1.483h15.032C17.335,12.031,18,12.695,18,13.516L18,13.516z"/>
+            </svg>
+          </span>
+        </label>
+
+        <div class="trigger"><a class="page-link" 
href="/blog/about/">About</a></div>
+      </nav></div>
+</header>
+<main class="page-content" aria-label="Content">
+      <div class="wrapper">
+        <article class="post h-entry" itemscope 
itemtype="http://schema.org/BlogPosting";>
+
+  <header class="post-header">
+    <h1 class="post-title p-name" itemprop="name headline">Apache DataFusion 
Comet 0.3.0 Release</h1>
+    <p class="post-meta">
+      <time class="dt-published" datetime="2024-09-27T00:00:00+00:00" 
itemprop="datePublished">Sep 27, 2024
+      </time>• <span itemprop="author" itemscope 
itemtype="http://schema.org/Person";><span class="p-author h-card" 
itemprop="name">pmc</span></span></p>
+  </header>
+
+  <div class="post-content e-content" itemprop="articleBody">
+    <!--
+
+-->
+
+<p>The Apache DataFusion PMC is pleased to announce version 0.3.0 of the <a 
href="https://datafusion.apache.org/comet/";>Comet</a> subproject.</p>
+
+<p>Comet is an accelerator for Apache Spark that translates Spark physical 
plans to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.</p>
+
+<p>Comet runs on commodity hardware and aims to provide 100% compatibility 
with Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless 
explicitly enabled by the user. Refer
+to the <a 
href="https://datafusion.apache.org/comet/user-guide/compatibility.html";>compatibility
 guide</a> for more information.</p>
+
+<p>This release covers approximately four weeks of development work and is the 
result of merging 57 PRs from 12 
+contributors. See the <a 
href="https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.3.0.md";>change
 log</a> for more information.</p>
+
+<h2 id="release-highlights">Release Highlights</h2>
+
+<h3 id="binary-releases">Binary Releases</h3>
+
+<p>Comet jar files are now published to Maven central for amd64 and arm64 
architectures (Linux only).</p>
+
+<p>Files can be found at 
https://central.sonatype.com/search?q=org.apache.datafusion</p>
+
+<ul>
+  <li>Spark versions 3.3, 3.4, and 3.5 are supported.</li>
+  <li>Scala versions 2.12 and 2.13 are supported.</li>
+</ul>
+
+<h3 id="new-features">New Features</h3>
+
+<p>The following expressions are now supported natively:</p>
+
+<ul>
+  <li><code class="language-plaintext highlighter-rouge">DateAdd</code></li>
+  <li><code class="language-plaintext highlighter-rouge">DateSub</code></li>
+  <li><code class="language-plaintext highlighter-rouge">ElementAt</code></li>
+  <li><code class="language-plaintext 
highlighter-rouge">GetArrayElement</code></li>
+  <li><code class="language-plaintext highlighter-rouge">ToJson</code></li>
+</ul>
+
+<h3 id="performance--stability">Performance &amp; Stability</h3>
+
+<ul>
+  <li>Upgraded to DataFusion 42.0.0</li>
+  <li>Reduced memory overhead due to some memory leaks being fixed</li>
+  <li>Comet will now fall back to Spark for queries that use DPP, to avoid 
performance regressions because Comet does 
+not have native support for DPP yet</li>
+  <li>Improved performance when converting Spark columnar data to Arrow 
format</li>
+  <li>Faster decimal sum and avg functions</li>
+</ul>
+
+<h3 id="documentation-updates">Documentation Updates</h3>
+
+<ul>
+  <li>Improved documentation for deploying Comet with Kubernetes and Helm in 
the <a 
href="https://datafusion.apache.org/comet/user-guide/kubernetes.html";>Comet 
Kubernetes Guide</a></li>
+  <li>More detailed architectural overview of Comet scan and execution in the 
<a 
href="https://datafusion.apache.org/comet/contributor-guide/plugin_overview.html";>Comet
 Plugin Overview</a> in the contributor guide</li>
+</ul>
+
+<h2 id="getting-involved">Getting Involved</h2>
+
+<p>The Comet project welcomes new contributors. We use the same <a 
href="https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord";>Slack
 and Discord</a> channels as the main DataFusion
+project.</p>
+
+<p>The easiest way to get involved is to test Comet with your current Spark 
jobs and file issues for any bugs or
+performance regressions that you find. See the <a 
href="https://datafusion.apache.org/comet/user-guide/installation.html";>Getting 
Started</a> guide for instructions on downloading and installing
+Comet.</p>
+
+<p>There are also many <a 
href="https://github.com/apache/datafusion-comet/contribute";>good first 
issues</a> waiting for contributions.</p>
+
+
+  </div><a class="u-url" href="/blog/2024/09/27/datafusion-comet-0.3.0/" 
hidden></a>
+</article>
+
+      </div>
+    </main><footer class="site-footer h-card">
+  <data class="u-url" href="/blog/"></data>
+
+  <div class="wrapper">
+
+    <h2 class="footer-heading">Apache DataFusion Project News &amp; Blog</h2>
+
+    <div class="footer-col-wrapper">
+      <div class="footer-col footer-col-1">
+        <ul class="contact-list">
+          <li class="p-name">Apache DataFusion Project News &amp; 
Blog</li><li><a class="u-email" 
href="mailto:[email protected]";>[email protected]</a></li></ul>
+      </div>
+
+      <div class="footer-col footer-col-2"><ul 
class="social-media-list"><li><a href="https://github.com/apache";><svg 
class="svg-icon"><use 
xlink:href="/blog/assets/minima-social-icons.svg#github"></use></svg> <span 
class="username">apache</span></a></li><li><a 
href="https://www.twitter.com/ApacheDataFusio";><svg class="svg-icon"><use 
xlink:href="/blog/assets/minima-social-icons.svg#twitter"></use></svg> <span 
class="username">ApacheDataFusio</span></a></li></ul>
+</div>
+
+      <div class="footer-col footer-col-3">
+        <p>Apache DataFusion is a very fast, extensible query engine for 
building high-quality  data-centric systems in Rust, using the Apache Arrow 
in-memory format.</p>
+      </div>
+    </div>
+
+  </div>
+
+</footer>
+</body>
+
+</html>
diff --git a/feed.xml b/feed.xml
index c542e2a..a48659e 100644
--- a/feed.xml
+++ b/feed.xml
@@ -1,4 +1,72 @@
-<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.3.3">Jekyll</generator><link 
href="https://datafusion.apache.org/blog/feed.xml"; rel="self" 
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"; 
rel="alternate" type="text/html" 
/><updated>2024-10-01T19:55:17+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
 type="html">Apache DataFusion Project News &amp;amp;  [...]
+<?xml version="1.0" encoding="utf-8"?><feed 
xmlns="http://www.w3.org/2005/Atom"; ><generator uri="https://jekyllrb.com/"; 
version="4.3.3">Jekyll</generator><link 
href="https://datafusion.apache.org/blog/feed.xml"; rel="self" 
type="application/atom+xml" /><link href="https://datafusion.apache.org/blog/"; 
rel="alternate" type="text/html" 
/><updated>2024-10-08T02:34:20+00:00</updated><id>https://datafusion.apache.org/blog/feed.xml</id><title
 type="html">Apache DataFusion Project News &amp;amp;  [...]
+
+-->
+
+<p>The Apache DataFusion PMC is pleased to announce version 0.3.0 of the <a 
href="https://datafusion.apache.org/comet/";>Comet</a> subproject.</p>
+
+<p>Comet is an accelerator for Apache Spark that translates Spark physical 
plans to DataFusion physical plans for
+improved performance and efficiency without requiring any code changes.</p>
+
+<p>Comet runs on commodity hardware and aims to provide 100% compatibility 
with Apache Spark. Any operators or
+expressions that are not fully compatible will fall back to Spark unless 
explicitly enabled by the user. Refer
+to the <a 
href="https://datafusion.apache.org/comet/user-guide/compatibility.html";>compatibility
 guide</a> for more information.</p>
+
+<p>This release covers approximately four weeks of development work and is the 
result of merging 57 PRs from 12 
+contributors. See the <a 
href="https://github.com/apache/datafusion-comet/blob/main/dev/changelog/0.3.0.md";>change
 log</a> for more information.</p>
+
+<h2 id="release-highlights">Release Highlights</h2>
+
+<h3 id="binary-releases">Binary Releases</h3>
+
+<p>Comet jar files are now published to Maven central for amd64 and arm64 
architectures (Linux only).</p>
+
+<p>Files can be found at 
https://central.sonatype.com/search?q=org.apache.datafusion</p>
+
+<ul>
+  <li>Spark versions 3.3, 3.4, and 3.5 are supported.</li>
+  <li>Scala versions 2.12 and 2.13 are supported.</li>
+</ul>
+
+<h3 id="new-features">New Features</h3>
+
+<p>The following expressions are now supported natively:</p>
+
+<ul>
+  <li><code class="language-plaintext highlighter-rouge">DateAdd</code></li>
+  <li><code class="language-plaintext highlighter-rouge">DateSub</code></li>
+  <li><code class="language-plaintext highlighter-rouge">ElementAt</code></li>
+  <li><code class="language-plaintext 
highlighter-rouge">GetArrayElement</code></li>
+  <li><code class="language-plaintext highlighter-rouge">ToJson</code></li>
+</ul>
+
+<h3 id="performance--stability">Performance &amp; Stability</h3>
+
+<ul>
+  <li>Upgraded to DataFusion 42.0.0</li>
+  <li>Reduced memory overhead due to some memory leaks being fixed</li>
+  <li>Comet will now fall back to Spark for queries that use DPP, to avoid 
performance regressions because Comet does 
+not have native support for DPP yet</li>
+  <li>Improved performance when converting Spark columnar data to Arrow 
format</li>
+  <li>Faster decimal sum and avg functions</li>
+</ul>
+
+<h3 id="documentation-updates">Documentation Updates</h3>
+
+<ul>
+  <li>Improved documentation for deploying Comet with Kubernetes and Helm in 
the <a 
href="https://datafusion.apache.org/comet/user-guide/kubernetes.html";>Comet 
Kubernetes Guide</a></li>
+  <li>More detailed architectural overview of Comet scan and execution in the 
<a 
href="https://datafusion.apache.org/comet/contributor-guide/plugin_overview.html";>Comet
 Plugin Overview</a> in the contributor guide</li>
+</ul>
+
+<h2 id="getting-involved">Getting Involved</h2>
+
+<p>The Comet project welcomes new contributors. We use the same <a 
href="https://datafusion.apache.org/contributor-guide/communication.html#slack-and-discord";>Slack
 and Discord</a> channels as the main DataFusion
+project.</p>
+
+<p>The easiest way to get involved is to test Comet with your current Spark 
jobs and file issues for any bugs or
+performance regressions that you find. See the <a 
href="https://datafusion.apache.org/comet/user-guide/installation.html";>Getting 
Started</a> guide for instructions on downloading and installing
+Comet.</p>
+
+<p>There are also many <a 
href="https://github.com/apache/datafusion-comet/contribute";>good first 
issues</a> waiting for 
contributions.</p>]]></content><author><name>pmc</name></author><category 
term="subprojects" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title type="html">Using 
StringView / German Style Strings to Make Queries Faster: Part 1- Reading 
Parquet</title><link 
href="https://datafusion.apache.org/blog/2024/09/13/string-view-german-style-strings-part-1/
 [...]
 
 -->
 
@@ -1395,381 +1463,4 @@ suitable for beginners is <a 
href="https://github.com/apache/arrow-datafusion/is
 meetings. Timezones are always a challenge for such meetings, but we hope to
 have two calls that can work for most attendees. If you are interested
 in helping, or just want to say hi, please drop us a note via one of 
-the methods listed in our <a 
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html";>Communication
 Doc</a>.</p>]]></content><author><name>pmc</name></author><category 
term="release" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry><entry><title 
type="html">Aggregating Millions of Groups Fast in Apache Arrow DataFusion 
28.0.0</title><link 
href="https://datafusion.apache.org/blog/2023/08/05/datafusion_fast_grouping/"; 
rel="alternate" type="text/html" title= [...]
-
--->
-
-<!-- Converted from Google Docs using 
https://www.buymeacoffee.com/docstomarkdown -->
-
-<h2 
id="aggregating-millions-of-groups-fast-in-apache-arrow-datafusion">Aggregating 
Millions of Groups Fast in Apache Arrow DataFusion</h2>
-
-<p>Andrew Lamb, Daniël Heres, Raphael Taylor-Davies,</p>
-
-<p><em>Note: this article was originally published on the <a 
href="https://www.influxdata.com/blog/aggregating-millions-groups-fast-apache-arrow-datafusion";>InfluxData
 Blog</a></em></p>
-
-<h2 id="tldr">TLDR</h2>
-
-<p>Grouped aggregations are a core part of any analytic tool, creating 
understandable summaries of huge data volumes. <a 
href="https://arrow.apache.org/datafusion/";>Apache Arrow DataFusion</a>’s 
parallel aggregation capability is 2-3x faster in the <a 
href="https://crates.io/crates/datafusion/28.0.0";>newly released version <code 
class="language-plaintext highlighter-rouge">28.0.0</code></a> for queries with 
a large number (10,000 or more) of groups.</p>
-
-<p>Improving aggregation performance matters to all users of DataFusion. For 
example, both InfluxDB, a <a href="https://github.com/influxdata/influxdb";>time 
series data platform</a> and Coralogix, a <a 
href="https://coralogix.com/?utm_source=InfluxDB&amp;utm_medium=Blog&amp;utm_campaign=organic";>full-stack
 observability</a> platform, aggregate vast amounts of raw data to monitor and 
create insights for our customers. Improving DataFusion’s performance lets us 
provide better user experien [...]
-
-<p>With the new optimizations, DataFusion’s grouping speed is now close to 
DuckDB, a system that regularly reports <a 
href="https://duckdblabs.github.io/db-benchmark/";>great</a> <a 
href="https://duckdb.org/2022/03/07/aggregate-hashtable.html#experiments";>grouping</a>
 benchmark performance numbers. Figure 1 contains a representative sample of <a 
href="https://github.com/ClickHouse/ClickBench/tree/main";>ClickBench</a> on a 
single Parquet file, and the full results are at the end of this ar [...]
-
-<p><img src="/blog/assets/datafusion_fast_grouping/summary.png" width="700" 
/></p>
-
-<p><strong>Figure 1</strong>: Query performance for ClickBench queries on 
queries 16, 17, 18 and 19 on a single Parquet file for DataFusion <code 
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code 
class="language-plaintext highlighter-rouge">28.0.0</code> and DuckDB <code 
class="language-plaintext highlighter-rouge">0.8.1</code>.</p>
-
-<h2 id="introduction-to-high-cardinality-grouping">Introduction to high 
cardinality grouping</h2>
-
-<p>Aggregation is a fancy word for computing summary statistics across many 
rows that have the same value in one or more columns. We call the rows with the 
same values <em>groups</em> and “high cardinality” means there are a large 
number of distinct groups in the dataset. At the time of writing, a “large” 
number of groups in analytic engines is around 10,000.</p>
-
-<p>For example the <a 
href="https://github.com/ClickHouse/ClickBench";>ClickBench</a> <em>hits</em> 
dataset contains 100 million anonymized user clicks across a set of websites. 
ClickBench Query 17 is:</p>
-
-<div class="language-sql highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="k">SELECT</span> <span 
class="nv">"UserID"</span><span class="p">,</span> <span 
class="nv">"SearchPhrase"</span><span class="p">,</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span>
-<span class="k">FROM</span> <span class="n">hits</span>
-<span class="k">GROUP</span> <span class="k">BY</span> <span 
class="nv">"UserID"</span><span class="p">,</span> <span 
class="nv">"SearchPhrase"</span>
-<span class="k">ORDER</span> <span class="k">BY</span> <span 
class="k">COUNT</span><span class="p">(</span><span class="o">*</span><span 
class="p">)</span>
-<span class="k">DESC</span> <span class="k">LIMIT</span> <span 
class="mi">10</span><span class="p">;</span>
-</code></pre></div></div>
-
-<p>In English, this query finds “the top ten (user, search phrase) 
combinations, across all clicks” and produces the following results (there are 
no search phrases for the top ten users):</p>
-
-<div class="language-text highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>+---------------------+--------------+-----------------+
-| UserID              | SearchPhrase | COUNT(UInt8(1)) |
-+---------------------+--------------+-----------------+
-| 1313338681122956954 |              | 29097           |
-| 1907779576417363396 |              | 25333           |
-| 2305303682471783379 |              | 10597           |
-| 7982623143712728547 |              | 6669            |
-| 7280399273658728997 |              | 6408            |
-| 1090981537032625727 |              | 6196            |
-| 5730251990344211405 |              | 6019            |
-| 6018350421959114808 |              | 5990            |
-| 835157184735512989  |              | 5209            |
-| 770542365400669095  |              | 4906            |
-+---------------------+--------------+-----------------+
-</code></pre></div></div>
-
-<p>The ClickBench dataset contains</p>
-
-<ul>
-  <li>99,997,497 total rows<sup id="fnref:1" role="doc-noteref"><a 
href="#fn:1" class="footnote" rel="footnote">1</a></sup></li>
-  <li>17,630,976 different users (distinct UserIDs)<sup id="fnref:2" 
role="doc-noteref"><a href="#fn:2" class="footnote" 
rel="footnote">2</a></sup></li>
-  <li>6,019,103 different search phrases<sup id="fnref:3" 
role="doc-noteref"><a href="#fn:3" class="footnote" 
rel="footnote">3</a></sup></li>
-  <li>24,070,560 distinct combinations<sup id="fnref:4" role="doc-noteref"><a 
href="#fn:4" class="footnote" rel="footnote">4</a></sup> of (UserID, 
SearchPhrase)
-Thus, to answer the query, DataFusion must map each of the 100M different 
input rows into one of the <strong>24 million different groups</strong>, and 
keep count of how many such rows there are in each group.</li>
-</ul>
-
-<h2 id="the-solution">The solution</h2>
-
-<p>Like most concepts in databases and other analytic systems, the basic ideas 
of this algorithm are straightforward and taught in introductory computer 
science courses. You could compute the query with a program such as this<sup 
id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" 
rel="footnote">5</a></sup>:</p>
-
-<div class="language-python highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code><span class="kn">import</span> <span 
class="n">pandas</span> <span class="k">as</span> <span class="n">pd</span>
-<span class="kn">from</span> <span class="n">collections</span> <span 
class="kn">import</span> <span class="n">defaultdict</span>
-<span class="kn">from</span> <span class="n">operator</span> <span 
class="kn">import</span> <span class="n">itemgetter</span>
-
-<span class="c1"># read file
-</span><span class="n">hits</span> <span class="o">=</span> <span 
class="n">pd</span><span class="p">.</span><span 
class="nf">read_parquet</span><span class="p">(</span><span 
class="sh">'</span><span class="s">hits.parquet</span><span 
class="sh">'</span><span class="p">,</span> <span class="n">engine</span><span 
class="o">=</span><span class="sh">'</span><span class="s">pyarrow</span><span 
class="sh">'</span><span class="p">)</span>
-
-<span class="c1"># build groups
-</span><span class="n">counts</span> <span class="o">=</span> <span 
class="nf">defaultdict</span><span class="p">(</span><span 
class="nb">int</span><span class="p">)</span>
-<span class="k">for</span> <span class="n">index</span><span 
class="p">,</span> <span class="n">row</span> <span class="ow">in</span> <span 
class="n">hits</span><span class="p">.</span><span 
class="nf">iterrows</span><span class="p">():</span>
-    <span class="n">group</span> <span class="o">=</span> <span 
class="p">(</span><span class="n">row</span><span class="p">[</span><span 
class="sh">'</span><span class="s">UserID</span><span class="sh">'</span><span 
class="p">],</span> <span class="n">row</span><span class="p">[</span><span 
class="sh">'</span><span class="s">SearchPhrase</span><span 
class="sh">'</span><span class="p">]);</span>
-    <span class="c1"># update the dict entry for the corresponding key
-</span>    <span class="n">counts</span><span class="p">[</span><span 
class="n">group</span><span class="p">]</span> <span class="o">+=</span> <span 
class="mi">1</span>
-
-<span class="c1"># Print the top 10 values
-</span><span class="nf">print </span><span class="p">(</span><span 
class="nf">dict</span><span class="p">(</span><span 
class="nf">sorted</span><span class="p">(</span><span 
class="n">counts</span><span class="p">.</span><span 
class="nf">items</span><span class="p">(),</span> <span 
class="n">key</span><span class="o">=</span><span 
class="nf">itemgetter</span><span class="p">(</span><span 
class="mi">1</span><span class="p">),</span> <span 
class="n">reverse</span><span class="o">=</span><sp [...]
-</code></pre></div></div>
-
-<p>This approach, while simple, is both slow and very memory inefficient. It 
requires over 40 seconds to compute the results for less than 1% of the 
dataset<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" 
rel="footnote">6</a></sup>. Both DataFusion <code class="language-plaintext 
highlighter-rouge">28.0.0</code> and DuckDB <code class="language-plaintext 
highlighter-rouge">0.8.1</code> compute results in under 10 seconds for the 
<em>entire</em> dataset.</p>
-
-<p>To answer this query quickly and efficiently, you have to write your code 
such that it:</p>
-
-<ol>
-  <li>Keeps all cores busy aggregating via parallelized computation</li>
-  <li>Updates aggregate values quickly, using vectorizable loops that are easy 
for compilers to translate into the high performance <a 
href="https://en.wikipedia.org/wiki/Single_instruction,_multiple_data";>SIMD</a> 
instructions available in modern CPUs.</li>
-</ol>
-
-<p>The rest of this article explains how grouping works in DataFusion and the 
improvements we made in <code class="language-plaintext 
highlighter-rouge">28.0.0</code>.</p>
-
-<h3 id="two-phase-parallel-partitioned-grouping">Two phase parallel 
partitioned grouping</h3>
-
-<p>Both DataFusion <code class="language-plaintext 
highlighter-rouge">27.0.</code> and <code class="language-plaintext 
highlighter-rouge">28.0.0</code> use state-of-the-art, two phase parallel hash 
partitioned grouping, similar to other high-performance vectorized engines like 
<a href="https://duckdb.org/2022/03/07/aggregate-hashtable.html";>DuckDB’s 
Parallel Grouped Aggregates</a>. In pictures this looks like:</p>
-
-<div class="language-text highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>            ▲                        ▲
-            │                        │
-            │                        │
-            │                        │
-┌───────────────────────┐  ┌───────────────────┐
-│        GroupBy        │  │      GroupBy      │      Step 4
-│        (Final)        │  │      (Final)      │
-└───────────────────────┘  └───────────────────┘
-            ▲                        ▲
-            │                        │
-            └────────────┬───────────┘
-                         │
-                         │
-            ┌─────────────────────────┐
-            │       Repartition       │               Step 3
-            │         HASH(x)         │
-            └─────────────────────────┘
-                         ▲
-                         │
-            ┌────────────┴──────────┐
-            │                       │
-            │                       │
- ┌────────────────────┐  ┌─────────────────────┐
- │      GroupyBy      │  │       GroupBy       │      Step 2
- │     (Partial)      │  │      (Partial)      │
- └────────────────────┘  └─────────────────────┘
-            ▲                       ▲
-         ┌──┘                       └─┐
-         │                            │
-    .─────────.                  .─────────.
- ,─'           '─.            ,─'           '─.
-;      Input      :          ;      Input      :      Step 1
-:    Stream 1     ;          :    Stream 2     ;
- ╲               ╱            ╲               ╱
-  '─.         ,─'              '─.         ,─'
-     `───────'                    `───────'
-</code></pre></div></div>
-
-<p><strong>Figure 2</strong>: Two phase repartitioned grouping: data flows 
from bottom (source) to top (results) in two phases. First (Steps 1 and 2), 
each core reads the data into a core-specific hash table, computing 
intermediate aggregates without any cross-core coordination. Then (Steps 3 and 
4) DataFusion divides the data (“repartitions”) into distinct subsets by group 
value, and each subset is sent to a specific core which computes the final 
aggregate.</p>
-
-<p>The two phases are critical for keeping cores busy in a multi-core system. 
Both phases use the same hash table approach (explained in the next section), 
but differ in how the groups are distributed and the partial results emitted 
from the accumulators. The first phase aggregates data as soon as possible 
after it is produced. However, as shown in Figure 2, the groups can be anywhere 
in any input, so the same group is often found on many different cores. The 
second phase uses a hash fun [...]
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>    ┌─────┐    ┌─────┐
-    │  1  │    │  3  │
-    │  2  │    │  4  │   2. After Repartitioning: each
-    └─────┘    └─────┘   group key  appears in exactly
-    ┌─────┐    ┌─────┐   one partition
-    │  1  │    │  3  │
-    │  2  │    │  4  │
-    └─────┘    └─────┘
-
-─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
-
-    ┌─────┐    ┌─────┐
-    │  2  │    │  2  │
-    │  1  │    │  2  │
-    │  3  │    │  3  │
-    │  4  │    │  1  │
-    └─────┘    └─────┘    1. Input Stream: groups
-      ...        ...      values are spread
-    ┌─────┐    ┌─────┐    arbitrarily over each input
-    │  1  │    │  4  │
-    │  4  │    │  3  │
-    │  1  │    │  1  │
-    │  4  │    │  3  │
-    │  3  │    │  2  │
-    │  2  │    │  2  │
-    │  2  │    └─────┘
-    └─────┘
-
-    Core A      Core B
-
-</code></pre></div></div>
-
-<p><strong>Figure 3</strong>: Group value distribution across 2 cores during 
aggregation phases. In the first phase, every group value <code 
class="language-plaintext highlighter-rouge">1</code>, <code 
class="language-plaintext highlighter-rouge">2</code>, <code 
class="language-plaintext highlighter-rouge">3</code>, <code 
class="language-plaintext highlighter-rouge">4</code>, is present in the input 
stream processed by each core. In the second phase, after repartitioning, the 
group value [...]
-
-<p>There are some additional subtleties in the <a 
href="https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/physical_plan/aggregates/row_hash.rs";>DataFusion
 implementation</a> not mentioned above due to space constraints, such as:</p>
-
-<ol>
-  <li>The policy of when to emit data from the first phase’s hash table (e.g. 
because the data is partially sorted)</li>
-  <li>Handling specific filters per aggregate (due to the <code 
class="language-plaintext highlighter-rouge">FILTER</code> SQL clause)</li>
-  <li>Data types of intermediate values (which may not be the same as the 
final output for some aggregates such as <code class="language-plaintext 
highlighter-rouge">AVG</code>).</li>
-  <li>Action taken when memory use exceeds its budget.</li>
-</ol>
-
-<h3 id="hash-grouping">Hash grouping</h3>
-
-<p>DataFusion queries can compute many different aggregate functions for each 
group, both <a 
href="https://arrow.apache.org/datafusion/user-guide/sql/aggregate_functions.html";>built
 in</a> and/or user defined <a 
href="https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.AggregateUDF.html";><code
 class="language-plaintext highlighter-rouge">AggregateUDFs</code></a>. The 
state for each aggregate function, called an <em>accumulator</em>, is tracked 
with a hash table (DataFusion u [...]
-
-<h3 id="hash-grouping-in-2700">Hash grouping in <code 
class="language-plaintext highlighter-rouge">27.0.0</code></h3>
-
-<p>As shown in Figure 3, DataFusion <code class="language-plaintext 
highlighter-rouge">27.0.0</code> stores the data in a <a 
href="https://github.com/apache/arrow-datafusion/blob/4d93b6a3802151865b68967bdc4c7d7ef425b49a/datafusion/core/src/physical_plan/aggregates/utils.rs#L38-L50";><code
 class="language-plaintext highlighter-rouge">GroupState</code></a> structure 
which, unsurprisingly, tracks the state for each group. The state for each 
group consists of:</p>
-
-<ol>
-  <li>The actual value of the group columns, in <a 
href="https://docs.rs/arrow-row/latest/arrow_row/index.html";>Arrow Row</a> 
format.</li>
-  <li>In-progress accumulations (e.g. the running counts for the <code 
class="language-plaintext highlighter-rouge">COUNT</code> aggregate) for each 
group, in one of two possible formats (<a 
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/expr/src/accumulator.rs#L24-L49";><code
 class="language-plaintext highlighter-rouge">Accumulator</code></a>  or <a 
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b42
 [...]
-  <li>Scratch space for tracking which rows match each aggregate in each 
batch.</li>
-</ol>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>                           
┌──────────────────────────────────────┐
-                           │                                      │
-                           │                  ...                 │
-                           │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ │
-                           │ ┃                                  ┃ │
-    ┌─────────┐            │ ┃ ┌──────────────────────────────┐ ┃ │
-    │         │            │ ┃ │group values: OwnedRow        │ ┃ │
-    │ ┌─────┐ │            │ ┃ └──────────────────────────────┘ ┃ │
-    │ │  5  │ │            │ ┃ ┌──────────────────────────────┐ ┃ │
-    │ ├─────┤ │            │ ┃ │Row accumulator:              │ ┃ │
-    │ │  9  │─┼────┐       │ ┃ │Vec&lt;u8&gt;                       │ ┃ │
-    │ ├─────┤ │    │       │ ┃ └──────────────────────────────┘ ┃ │
-    │ │ ... │ │    │       │ ┃ ┌──────────────────────┐         ┃ │
-    │ ├─────┤ │    │       │ ┃ │┌──────────────┐      │         ┃ │
-    │ │  1  │ │    │       │ ┃ ││Accumulator 1 │      │         ┃ │
-    │ ├─────┤ │    │       │ ┃ │└──────────────┘      │         ┃ │
-    │ │ ... │ │    │       │ ┃ │┌──────────────┐      │         ┃ │
-    │ └─────┘ │    │       │ ┃ ││Accumulator 2 │      │         ┃ │
-    │         │    │       │ ┃ │└──────────────┘      │         ┃ │
-    └─────────┘    │       │ ┃ │ Box&lt;dyn Accumulator&gt; │         ┃ │
-    Hash Table     │       │ ┃ └──────────────────────┘         ┃ │
-                   │       │ ┃ ┌─────────────────────────┐      ┃ │
-                   │       │ ┃ │scratch indices: Vec&lt;u32&gt;│      ┃ │
-                   │       │ ┃ └─────────────────────────┘      ┃ │
-                   │       │ ┃ GroupState                       ┃ │
-                   └─────▶ │ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ │
-                           │                                      │
-  Hash table tracks an     │                 ...                  │
-  index into group_states  │                                      │
-                           └──────────────────────────────────────┘
-                           group_states: Vec&lt;GroupState&gt;
-
-                           There is one GroupState PER GROUP
-
-</code></pre></div></div>
-
-<p><strong>Figure 4</strong>: Hash group operator structure in DataFusion 
<code class="language-plaintext highlighter-rouge">27.0.0</code>. A hash table 
maps each group to a GroupState which contains all the per-group states.</p>
-
-<p>To compute the aggregate, DataFusion performs the following steps for each 
input batch:</p>
-
-<ol>
-  <li>Calculate hash using <a 
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/hash_utils.rs#L264-L307";>efficient
 vectorized code</a>, specialized for each data type.</li>
-  <li>Determine group indexes for each input row using the hash table 
(creating new entries for newly seen groups).</li>
-  <li><a 
href="https://github.com/apache/arrow-datafusion/blob/4ab8be57dee3bfa72dd105fbd7b8901b873a4878/datafusion/core/src/physical_plan/aggregates/row_hash.rs#L562-L602";>Update
 Accumulators for each group that had input rows,</a> assembling the rows into 
a contiguous range for vectorized accumulator if there are a sufficient number 
of them.</li>
-</ol>
-
-<p>DataFusion also stores the hash values in the table to avoid potentially 
costly hash recomputation when resizing the hash table.</p>
-
-<p>This scheme works very well for a relatively small number of distinct 
groups: all accumulators are efficiently updated with large contiguous batches 
of rows.</p>
-
-<p>However, this scheme is not ideal for high cardinality grouping due to:</p>
-
-<ol>
-  <li><strong>Multiple allocations per group</strong> for the group value row 
format, as well as for the <code class="language-plaintext 
highlighter-rouge">RowAccumulator</code>s and each  <code 
class="language-plaintext highlighter-rouge">Accumulator</code>. The <code 
class="language-plaintext highlighter-rouge">Accumulator</code> may have 
additional allocations within it as well.</li>
-  <li><strong>Non-vectorized updates:</strong> Accumulator updates often fall 
back to a slower non-vectorized form because the number of distinct groups is 
large (and thus number of values per group is small) in each input batch.</li>
-</ol>
-
-<h3 id="hash-grouping-in-2800">Hash grouping in <code 
class="language-plaintext highlighter-rouge">28.0.0</code></h3>
-
-<p>For <code class="language-plaintext highlighter-rouge">28.0.0</code>, we 
rewrote the core group by implementation following traditional system 
optimization principles: fewer allocations, type specialization, and aggressive 
vectorization.</p>
-
-<p>DataFusion <code class="language-plaintext highlighter-rouge">28.0.0</code> 
uses the same RawTable and still stores group indexes. The major differences, 
as shown in Figure 4, are:</p>
-
-<ol>
-  <li>Group values are stored either
-    <ol>
-      <li>Inline in the <code class="language-plaintext 
highlighter-rouge">RawTable</code> (for single columns of primitive types), 
where the conversion to Row format costs more than its benefit</li>
-      <li>In a separate <a 
href="https://docs.rs/arrow-row/latest/arrow_row/struct.Row.html";>Rows</a> 
structure with a single contiguous allocation for all groups values, rather 
than an allocation per group. Accumulators manage the state for all the groups 
internally, so the code to update intermediate values is a tight type 
specialized loop. The new <a 
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/gr
 [...]
-    </ol>
-  </li>
-</ol>
-
-<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre 
class="highlight"><code>┌───────────────────────────────────┐     
┌───────────────────────┐
-│ ┌ ─ ─ ─ ─ ─ ┐  ┌─────────────────┐│     │ ┏━━━━━━━━━━━━━━━━━━━┓ │
-│                │                 ││     │ ┃  ┌──────────────┐ ┃ │
-│ │           │  │ ┌ ─ ─ ┐┌─────┐  ││     │ ┃  │┌───────────┐ │ ┃ │
-│                │    X   │  5  │  ││     │ ┃  ││  value1   │ │ ┃ │
-│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  │└───────────┘ │ ┃ │
-│                │    Q   │  9  │──┼┼──┐  │ ┃  │     ...      │ ┃ │
-│ │           │  │ ├ ─ ─ ┤├─────┤  ││  └──┼─╋─▶│              │ ┃ │
-│                │   ...  │ ... │  ││     │ ┃  │┌───────────┐ │ ┃ │
-│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  ││  valueN   │ │ ┃ │
-│                │    H   │  1  │  ││     │ ┃  │└───────────┘ │ ┃ │
-│ │           │  │ ├ ─ ─ ┤├─────┤  ││     │ ┃  │values: Vec&lt;T&gt;│ ┃ │
-│     Rows       │   ...  │ ... │  ││     │ ┃  └──────────────┘ ┃ │
-│ │           │  │ └ ─ ─ ┘└─────┘  ││     │ ┃                   ┃ │
-│  ─ ─ ─ ─ ─ ─   │                 ││     │ ┃ GroupsAccumulator ┃ │
-│                └─────────────────┘│     │ ┗━━━━━━━━━━━━━━━━━━━┛ │
-│                  Hash Table       │     │                       │
-│                                   │     │          ...          │
-└───────────────────────────────────┘     └───────────────────────┘
-  GroupState                               Accumulators
-
-
-Hash table value stores group_indexes     One  GroupsAccumulator
-and group values.                         per aggregate. Each
-                                          stores the state for
-Group values are stored either inline     *ALL* groups, typically
-in the hash table or in a single          using a native Vec&lt;T&gt;
-allocation using the arrow Row format
-</code></pre></div></div>
-
-<p><strong>Figure 5</strong>: Hash group operator structure in DataFusion 
<code class="language-plaintext highlighter-rouge">28.0.0</code>. Group values 
are stored either directly in the hash table, or in a single allocation using 
the arrow Row format. The hash table contains group indexes. A single <code 
class="language-plaintext highlighter-rouge">GroupsAccumulator</code> stores 
the per-aggregate state for <em>all</em> groups.</p>
-
-<p>This new structure improves performance significantly for high cardinality 
groups due to:</p>
-
-<ol>
-  <li><strong>Reduced allocations</strong>: There are no longer any individual 
allocations per group.</li>
-  <li><strong>Contiguous native accumulator states</strong>: Type-specialized 
accumulators store the values for all groups in a single contiguous allocation 
using a <a href="https://doc.rust-lang.org/std/vec/struct.Vec.html";>Rust 
Vec&lt;T&gt;</a> of some native type.</li>
-  <li><strong>Vectorized state update</strong>: The inner aggregate update 
loops, which are type-specialized and in terms of native <code 
class="language-plaintext highlighter-rouge">Vec</code>s, are well-vectorized 
by the Rust compiler (thanks <a href="https://llvm.org/";>LLVM</a>!).</li>
-</ol>
-
-<h3 id="notes">Notes</h3>
-
-<p>Some vectorized grouping implementations store the accumulator state 
row-wise directly in the hash table, which often uses modern CPU caches 
efficiently. Managing accumulator state in columnar fashion may sacrifice some 
cache locality, however it ensures the size of the hash table remains small, 
even when there are large numbers of groups and aggregates, making it easier 
for the compiler to vectorize the accumulator update.</p>
-
-<p>Depending on the cost of recomputing hash values, DataFusion <code 
class="language-plaintext highlighter-rouge">28.0.0</code> may or may not store 
the hash values in the table. This optimizes the tradeoff between the cost of 
computing the hash value (which is expensive for strings, for example) vs. the 
cost of storing it in the hash table.</p>
-
-<p>One subtlety that arises from pushing state updates into GroupsAccumulators 
is that each accumulator must handle similar variations with/without filtering 
and with/without nulls in the input. DataFusion <code class="language-plaintext 
highlighter-rouge">28.0.0</code> uses a templated <a 
href="https://github.com/apache/arrow-datafusion/blob/a6dcd943051a083693c352c6b4279156548490a0/datafusion/physical-expr/src/aggregate/groups_accumulator/accumulate.rs#L28-L54";><code
 class="language-pla [...]
-
-<p>The code structure is heavily influenced by the fact DataFusion is 
implemented using <a href="https://www.rust-lang.org/";>Rust</a>, a new(ish) 
systems programming language focused on speed and safety. Rust heavily 
discourages many of the traditional pointer casting “tricks” used in C/C++ hash 
grouping implementations. The DataFusion aggregation code is almost entirely <a 
href="https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html#:~:text=Safe%20Rust%20is%20the%20true,Undefined%2
 [...]
-
-<h2 id="clickbench-results">ClickBench results</h2>
-
-<p>The full results of running the <a 
href="https://github.com/ClickHouse/ClickBench/tree/main";>ClickBench</a> 
queries against the single Parquet file with DataFusion <code 
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code 
class="language-plaintext highlighter-rouge">28.0.0</code>, and DuckDB <code 
class="language-plaintext highlighter-rouge">0.8.1</code> are below. These 
numbers were run on a GCP <code class="language-plaintext 
highlighter-rouge">e2-standard-8 [...]
-
-<p>As the industry moves towards data systems assembled from components, it is 
increasingly important that they exchange data using open standards such as <a 
href="https://arrow.apache.org/";>Apache Arrow</a> and <a 
href="https://parquet.apache.org/";>Parquet</a> rather than custom storage and 
in-memory formats. Thus, this benchmark uses a single input Parquet file 
representative of many DataFusion users and aligned with the current trend in 
analytics of avoiding a costly load/transformati [...]
-
-<p>DataFusion now reaches near-DuckDB-speeds querying Parquet data. While we 
don’t plan to engage in a benchmarking shootout with a team that literally 
wrote <a href="https://dl.acm.org/doi/abs/10.1145/3209950.3209955";>Fair 
Benchmarking Considered Difficult</a>, hopefully everyone can agree that 
DataFusion <code class="language-plaintext highlighter-rouge">28.0.0</code> is 
a significant improvement.</p>
-
-<p><img src="/blog/assets/datafusion_fast_grouping/full.png" width="700" /></p>
-
-<p><strong>Figure 6</strong>: Performance of DataFusion <code 
class="language-plaintext highlighter-rouge">27.0.0</code>, DataFusion <code 
class="language-plaintext highlighter-rouge">28.0.0</code>, and DuckDB <code 
class="language-plaintext highlighter-rouge">0.8.1</code> on all 43 ClickBench 
queries against a single <code class="language-plaintext 
highlighter-rouge">hits.parquet</code> file. Lower is better.</p>
-
-<h3 id="notes-1">Notes</h3>
-
-<p>DataFusion <code class="language-plaintext highlighter-rouge">27.0.0</code> 
was not able to run several queries due to either planner bugs (Q9, Q11, Q12, 
14) or running out of memory (Q33). DataFusion <code class="language-plaintext 
highlighter-rouge">28.0.0</code> solves those issues.</p>
-
-<p>DataFusion is faster than DuckDB for query 21 and 22, likely due to 
optimized implementations of string pattern matching.</p>
-
-<h2 id="conclusion-performance-matters">Conclusion: performance matters</h2>
-
-<p>Improving aggregation performance by more than a factor of two allows 
developers building products and projects with DataFusion to spend more time on 
value-added domain specific features. We believe building systems with 
DataFusion is much faster than trying to build something similar from scratch. 
DataFusion increases productivity because it eliminates the need to rebuild 
well-understood, but costly to implement, analytic database technology. While 
we’re pleased with the improvements [...]
-
-<h2 id="acknowledgments">Acknowledgments</h2>
-
-<p>DataFusion is a <a 
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html";>community
 effort</a> and this work was not possible without contributions from many in 
the community. A special shout out to <a 
href="https://github.com/sunchao";>sunchao</a>, <a 
href="https://github.com/jyshen";>yjshen</a>, <a 
href="https://github.com/yahoNanJing";>yahoNanJing</a>, <a 
href="https://github.com/mingmwang";>mingmwang</a>, <a 
href="https://github.com/ozankabak";>ozankabak</a>, < [...]
-
-<h2 id="about-datafusion">About DataFusion</h2>
-
-<p><a href="https://arrow.apache.org/datafusion/";>Apache Arrow DataFusion</a> 
is an extensible query engine and database toolkit, written in <a 
href="https://www.rust-lang.org/";>Rust</a>, that uses <a 
href="https://arrow.apache.org/";>Apache Arrow</a> as its in-memory format. 
DataFusion, along with <a href="https://calcite.apache.org/";>Apache 
Calcite</a>, Facebook’s <a 
href="https://github.com/facebookincubator/velox";>Velox</a>, and similar 
technology are part of the next generation “<a h [...]
-
-<!-- Footnotes themselves at the bottom. -->
-<h2 id="notes-2">Notes</h2>
-
-<div class="footnotes" role="doc-endnotes">
-  <ol>
-    <li id="fn:1" role="doc-endnote">
-      <p><code class="language-plaintext highlighter-rouge">SELECT COUNT(*) 
FROM 'hits.parquet';</code> <a href="#fnref:1" class="reversefootnote" 
role="doc-backlink">&#8617;</a></p>
-    </li>
-    <li id="fn:2" role="doc-endnote">
-      <p><code class="language-plaintext highlighter-rouge">SELECT 
COUNT(DISTINCT "UserID") as num_users FROM 'hits.parquet';</code> <a 
href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
-    </li>
-    <li id="fn:3" role="doc-endnote">
-      <p><code class="language-plaintext highlighter-rouge">SELECT 
COUNT(DISTINCT "SearchPhrase") as num_phrases FROM 'hits.parquet';</code> <a 
href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
-    </li>
-    <li id="fn:4" role="doc-endnote">
-      <p><code class="language-plaintext highlighter-rouge">SELECT COUNT(*) 
FROM (SELECT DISTINCT "UserID", "SearchPhrase" FROM 'hits.parquet')</code> <a 
href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
-    </li>
-    <li id="fn:5" role="doc-endnote">
-      <p>Full script at <a 
href="https://github.com/alamb/datafusion-duckdb-benchmark/blob/main/hash.py";>hash.py</a>
 <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
-    </li>
-    <li id="fn:6" role="doc-endnote">
-      <p><a 
href="https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_%7B%7D.parquet";>hits_0.parquet</a>,
 one of the files from the partitioned ClickBench dataset, which has <code 
class="language-plaintext highlighter-rouge">100,000</code> rows and is 117 MB 
in size. The entire dataset has <code class="language-plaintext 
highlighter-rouge">100,000,000</code> rows in a single 14 GB Parquet file. The 
script did not complete on the entire dataset after 40 minutes, and us [...]
-    </li>
-  </ol>
-</div>]]></content><author><name>alamb, Dandandan, 
tustvold</name></author><category term="release" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry></feed>
\ No newline at end of file
+the methods listed in our <a 
href="https://arrow.apache.org/datafusion/contributor-guide/communication.html";>Communication
 Doc</a>.</p>]]></content><author><name>pmc</name></author><category 
term="release" /><summary 
type="html"><![CDATA[&lt;!–]]></summary></entry></feed>
\ No newline at end of file
diff --git a/index.html b/index.html
index 7b3aeee..0d1617e 100644
--- a/index.html
+++ b/index.html
@@ -38,7 +38,12 @@
       <div class="wrapper">
         <div class="home">
 <h2 class="post-list-heading">Posts</h2>
-    <ul class="post-list"><li><span class="post-meta">Sep 13, 2024</span>
+    <ul class="post-list"><li><span class="post-meta">Sep 27, 2024</span>
+        <h3>
+          <a class="post-link" href="/blog/2024/09/27/datafusion-comet-0.3.0/">
+            Apache DataFusion Comet 0.3.0 Release
+          </a>
+        </h3></li><li><span class="post-meta">Sep 13, 2024</span>
         <h3>
           <a class="post-link" 
href="/blog/2024/09/13/string-view-german-style-strings-part-2/">
             Using StringView / German Style Strings to make Queries Faster: 
Part 2 - String Operations


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(datafusion-site) branch asf-site updated: Comet 0.3.0 blog post (#30)

Reply via email to