(datafusion-site) branch asf-staging updated: Commit build products

github-bot Mon, 09 Jun 2025 22:31:15 -0700

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-staging
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git



The following commit(s) were added to refs/heads/asf-staging by this push:
     new 837cdf7  Commit build products
837cdf7 is described below

commit 837cdf7321be517b803d71adcf8017d3c9ce8298
Author: Build Pelican (action) <priv...@infra.apache.org>
AuthorDate: Tue Jun 10 03:14:34 2025 +0000

    Commit build products
---
 .../optimizing-sql-dataframes-part-one/index.html  | 20 ++++++++-------
 .../optimizing-sql-dataframes-part-two/index.html  | 10 ++++----
 blog/feeds/alamb-akurmustafa.atom.xml              | 30 ++++++++++++----------
 blog/feeds/all-en.atom.xml                         | 30 ++++++++++++----------
 blog/feeds/blog.atom.xml                           | 30 ++++++++++++----------
 5 files changed, 64 insertions(+), 56 deletions(-)

diff --git a/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html 
b/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html
index 7bc4b1a..93372ba 100644
--- a/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html
+++ b/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html
@@ -71,7 +71,7 @@ Pavlo, or some behind-the-scenes player. We believe this 
perception is because:<
 <li>
 <p>One must implement the rest of a database system (data storage, 
transactions,
    SQL parser, expression evaluation, plan execution, etc.) 
<strong>before</strong> the
-   optimizer becomes critical[^5].</p>
+   optimizer becomes critical<sup id="fn5"><a 
href="#footnote5">5</a></sup>.</p>
 </li>
 <li>
 <p>Some parts of the optimizer are tightly tied to the rest of the system 
(e.g.,
@@ -168,11 +168,11 @@ choosing specific aggregation algorithms.</p>
 <h2>Query Optimizer Implementation</h2>
 <p>Industrial optimizers, such as 
 DataFusion&rsquo;s (<a 
href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src";>source</a>),
-ClickHouse (<a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes";>source</a>,<a
 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations";>source</a>),
+ClickHouse (<a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes";>source</a>,
 <a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations";>source</a>),
 DuckDB (<a 
href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer";>source</a>),
 and Apache Spark (<a 
href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer";>source</a>),
 are implemented as a series of passes or rules that rewrite a query plan. The
-overall optimizer is composed of a sequence of these rules,[^6] as shown in
+overall optimizer is composed of a sequence of these rules,<sup id="fn6"><a 
href="#footnote6">6</a></sup> as shown in
 Figure 4. The specific order of the rules also often matters, but we will not
 discuss this detail in this post.</p>
 <p>A multi-pass design is standard because it helps:</p>
@@ -217,8 +217,8 @@ DataFusion</a> PMC member. A Database Optimizer
 connoisseur, he worked on the <a 
href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf";>Vertica 
Analytic
 Database</a> Query
 Optimizer for six years, has several granted US patents related to query
-optimization[^1], co-authored several papers[^2]  about the topic (including in
-VLDB 2024<a 
href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)">^3</a>),
 and spent several weeks<a 
href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]";
 
title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111)
 [https://www. [...]
+optimization<sup id="fn1"><a href="#footnote1">1</a></sup>, co-authored 
several papers<sup id="fn2"><a href="#footnote2">2</a></sup>  about the topic 
(including in
+VLDB 2024<sup id="fn3"><a href="#footnote3">3</a></sup>), and spent several 
weeks<sup id="fn4"><a href="#footnote4">4</a></sup> deeply geeking out about 
this topic
 with other experts (thank you Dagstuhl).</p>
 <p><a href="https://www.linkedin.com/in/akurmustafa/";>Mustafa Akur</a> is a 
PhD Student at
 <a href="https://www.ohsu.edu/";>OHSU</a> Knight Cancer Institute and an <a 
href="https://datafusion.apache.org/";>Apache
@@ -227,10 +227,12 @@ Software Developer at <a 
href="https://www.synnada.ai/";>Synnada</a> where he con
 significant features to the DataFusion optimizer, including many <a 
href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/";>sort-based
 optimizations</a>.</p>
 <h2>Notes</h2>
-<p>[^1]: <em>Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 
2012</em>, Query Optimizer with schema conversion US 8,086,598 &middot; Issued 
Dec 27, 2011</p>
-<p>[^2]: <a 
href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers";>The
 Vertica Query Optimizer: The case for specialized Query Optimizers</a></p>
-<p>[^5]: And thus in academic classes, by the time you get around to an 
optimizer the semester is over and everyone is ready for the semester to be 
done. Once industrial systems mature to the point where the optimizer is a 
bottleneck, the shiny new-ness of the<a 
href="https://en.wikipedia.org/wiki/Gartner_hype_cycle";> hype cycle</a> has 
worn off and it is likely in the trough of disappointment.  </p>
-<p>[^6]: Often systems will classify these passes into different categories, 
but I am simplifying here</p>
+<p id="footnote1"><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 
&middot; Issued Nov 13, 2012*, Query Optimizer with schema conversion US 
8,086,598 &middot; Issued Dec 27, 2011</p>
+<p id="footnote2"><sup>[2]</sup> [The Vertica Query Optimizer: The case for 
specialized Query 
Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)</p>
+<p id="footnote3"><sup>[3]</sup> 
[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)</p>
+<p id="footnote4"><sup>[4]</sup> 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111)
 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321)</p>
+<p id="footnote5"><sup>[5]</sup>  And thus in academic classes, by the time 
you get around to an optimizer the semester is over and everyone is ready for 
the semester to be done. Once industrial systems mature to the point where the 
optimizer is a bottleneck, the shiny new-ness of the[ hype 
cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is 
likely in the trough of disappointment.</p>
+<p id="footnote6"><sup>[6]</sup> Often systems will classify these passes into 
different categories, but I am simplifying here</p>
         </div>
       </div>
     </div>    
diff --git a/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html 
b/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html
index 2235435..15a1668 100644
--- a/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html
+++ b/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html
@@ -197,7 +197,7 @@ support row-at-a-time evaluation given how terrible the 
performance would be.
 Instead, analytic systems rewrite such queries into joins which can perform 
100s
 or 1000s of times faster for large datasets. However, transforming subqueries 
to
 joins requires &ldquo;exotic&rdquo; join semantics such as <code>SEMI 
JOIN</code>, <code>ANTI JOIN</code>  and
-variations on how to treat equality with null[^7].</p>
+variations on how to treat equality with null<sup id="fn7"><a 
href="#footnote7">7</a>.</sup></p>
 <p>For a simple example, consider that a query like this:</p>
 <pre><code class="language-sql">SELECT customer.name 
 FROM customer 
@@ -329,7 +329,7 @@ potentially (very) different performance. The major options 
in this category are
    estimates their cost and picks the one using the lowest cost.</p>
 </li>
 </ol>
-<p>For some examples, you can read about [Spark&rsquo;s cost-based optimizer] 
or look at
+<p>For some examples, you can read about <a 
href="https://docs.databricks.com/aws/en/optimizations/cbo";>Spark&rsquo;s 
cost-based optimizer</a> or look at
 the code for <a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs";>DataFusion&rsquo;s
 join selection</a> and <a 
href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp";>DuckDB&rsquo;s
 cost model</a> and <a 
href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472";>join
 order enumeration</a>.</p>
 <p>However, the use of heuristics and (imprecise) cost models means optimizers 
must</p>
@@ -361,7 +361,7 @@ Instead, keeping with its <a 
href="https://docs.rs/datafusion/latest/datafusion/
 implementation along with extension points to customize behavior.</p>
 <p>Specifically, DataFusion includes</p>
 <ol>
-<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in 
the query[^8]) with basic join re-ordering (<a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs";>source</a>)
 to prevent join disasters.</li>
+<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in 
the query<sup id="fn8"><a href="#footnote8">8</a>) with basic join re-ordering 
(<a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs";>source</a>)
 to prevent join disasters.</sup></li>
 <li>Support for <a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html";>ColumnStatistics</a>
 and <a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html";>Table
 Statistics</a></li>
 <li>The framework for <a 
href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity";>filter
 selectivity</a> + join cardinality estimation.</li>
 <li>APIs for easily rewriting plans, such as the <a 
href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview";>TreeNode
 API</a> and <a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs";>reordering
 joins</a></li>
@@ -401,8 +401,8 @@ learning more about how they are designed and implemented, 
please <a href="https
 community</a>. We welcome first time contributors as well as long time 
participants
 to the fun of building a database together.</p>
 <h2>Notes</h2>
-<p>[^7]: See <a 
href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf";>Unnesting
 Arbitrary Queries</a> from Neumann and Kemper for a more academic 
treatment.</p>
-<p>[^8]: One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online 
lectures</p>
+<p id="footnote7"><sup>[7]</sup> See [Unnesting Arbitrary 
Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
 from Neumann and Kemper for a more academic treatment.</p>
+<p id="footnote8"><sup>[8]</sup> One of my favorite terms I learned from Andy 
Pavlo&rsquo;s CMU online lectures</p>
         </div>
       </div>
     </div>    
diff --git a/blog/feeds/alamb-akurmustafa.atom.xml 
b/blog/feeds/alamb-akurmustafa.atom.xml
index d4a265f..2d45f81 100644
--- a/blog/feeds/alamb-akurmustafa.atom.xml
+++ b/blog/feeds/alamb-akurmustafa.atom.xml
@@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this 
perception is because:&
 &lt;li&gt;
 &lt;p&gt;One must implement the rest of a database system (data storage, 
transactions,
    SQL parser, expression evaluation, plan execution, etc.) 
&lt;strong&gt;before&lt;/strong&gt; the
-   optimizer becomes critical[^5].&lt;/p&gt;
+   optimizer becomes critical&lt;sup id="fn5"&gt;&lt;a 
href="#footnote5"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
 &lt;/li&gt;
 &lt;li&gt;
 &lt;p&gt;Some parts of the optimizer are tightly tied to the rest of the 
system (e.g.,
@@ -150,11 +150,11 @@ choosing specific aggregation algorithms.&lt;/p&gt;
 &lt;h2&gt;Query Optimizer Implementation&lt;/h2&gt;
 &lt;p&gt;Industrial optimizers, such as 
 DataFusion&amp;rsquo;s (&lt;a 
href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src"&gt;source&lt;/a&gt;),
-ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,&lt;a
 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
+ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,
 &lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
 DuckDB (&lt;a 
href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer"&gt;source&lt;/a&gt;),
 and Apache Spark (&lt;a 
href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer"&gt;source&lt;/a&gt;),
 are implemented as a series of passes or rules that rewrite a query plan. The
-overall optimizer is composed of a sequence of these rules,[^6] as shown in
+overall optimizer is composed of a sequence of these rules,&lt;sup 
id="fn6"&gt;&lt;a href="#footnote6"&gt;6&lt;/a&gt;&lt;/sup&gt; as shown in
 Figure 4. The specific order of the rules also often matters, but we will not
 discuss this detail in this post.&lt;/p&gt;
 &lt;p&gt;A multi-pass design is standard because it helps:&lt;/p&gt;
@@ -199,8 +199,8 @@ DataFusion&lt;/a&gt; PMC member. A Database Optimizer
 connoisseur, he worked on the &lt;a 
href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf"&gt;Vertica 
Analytic
 Database&lt;/a&gt; Query
 Optimizer for six years, has several granted US patents related to query
-optimization[^1], co-authored several papers[^2]  about the topic (including in
-VLDB 2024&lt;a 
href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)"&gt;^3&lt;/a&gt;),
 and spent several weeks&lt;a 
href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]";
 
title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111
 [...]
+optimization&lt;sup id="fn1"&gt;&lt;a 
href="#footnote1"&gt;1&lt;/a&gt;&lt;/sup&gt;, co-authored several papers&lt;sup 
id="fn2"&gt;&lt;a href="#footnote2"&gt;2&lt;/a&gt;&lt;/sup&gt;  about the topic 
(including in
+VLDB 2024&lt;sup id="fn3"&gt;&lt;a 
href="#footnote3"&gt;3&lt;/a&gt;&lt;/sup&gt;), and spent several weeks&lt;sup 
id="fn4"&gt;&lt;a href="#footnote4"&gt;4&lt;/a&gt;&lt;/sup&gt; deeply geeking 
out about this topic
 with other experts (thank you Dagstuhl).&lt;/p&gt;
 &lt;p&gt;&lt;a href="https://www.linkedin.com/in/akurmustafa/"&gt;Mustafa 
Akur&lt;/a&gt; is a PhD Student at
 &lt;a href="https://www.ohsu.edu/"&gt;OHSU&lt;/a&gt; Knight Cancer Institute 
and an &lt;a href="https://datafusion.apache.org/"&gt;Apache
@@ -209,10 +209,12 @@ Software Developer at &lt;a 
href="https://www.synnada.ai/"&gt;Synnada&lt;/a&gt;
 significant features to the DataFusion optimizer, including many &lt;a 
href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/"&gt;sort-based
 optimizations&lt;/a&gt;.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^1]: &lt;em&gt;Modular Query Optimizer, US 8,312,027 &amp;middot; 
Issued Nov 13, 2012&lt;/em&gt;, Query Optimizer with schema conversion US 
8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
-&lt;p&gt;[^2]: &lt;a 
href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers"&gt;The
 Vertica Query Optimizer: The case for specialized Query 
Optimizers&lt;/a&gt;&lt;/p&gt;
-&lt;p&gt;[^5]: And thus in academic classes, by the time you get around to an 
optimizer the semester is over and everyone is ready for the semester to be 
done. Once industrial systems mature to the point where the optimizer is a 
bottleneck, the shiny new-ness of the&lt;a 
href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"&gt; hype 
cycle&lt;/a&gt; has worn off and it is likely in the trough of disappointment.  
&lt;/p&gt;
-&lt;p&gt;[^6]: Often systems will classify these passes into different 
categories, but I am simplifying here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho
 [...]
+&lt;p id="footnote1"&gt;&lt;sup&gt;[1]&lt;/sup&gt; *Modular Query Optimizer, 
US 8,312,027 &amp;middot; Issued Nov 13, 2012*, Query Optimizer with schema 
conversion US 8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
+&lt;p id="footnote2"&gt;&lt;sup&gt;[2]&lt;/sup&gt; [The Vertica Query 
Optimizer: The case for specialized Query 
Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)&lt;/p&gt;
+&lt;p id="footnote3"&gt;&lt;sup&gt;[3]&lt;/sup&gt; 
[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)&lt;/p&gt;
+&lt;p id="footnote4"&gt;&lt;sup&gt;[4]&lt;/sup&gt; 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111)
 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem
 [...]
+&lt;p id="footnote5"&gt;&lt;sup&gt;[5]&lt;/sup&gt;  And thus in academic 
classes, by the time you get around to an optimizer the semester is over and 
everyone is ready for the semester to be done. Once industrial systems mature 
to the point where the optimizer is a bottleneck, the shiny new-ness of the[ 
hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and 
it is likely in the trough of disappointment.&lt;/p&gt;
+&lt;p id="footnote6"&gt;&lt;sup&gt;[6]&lt;/sup&gt; Often systems will classify 
these passes into different categories, but I am simplifying 
here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the 
performance would be.
 Instead, analytic systems rewrite such queries into joins which can perform 
100s
 or 1000s of times faster for large datasets. However, transforming subqueries 
to
 joins requires &amp;ldquo;exotic&amp;rdquo; join semantics such as 
&lt;code&gt;SEMI JOIN&lt;/code&gt;, &lt;code&gt;ANTI JOIN&lt;/code&gt;  and
-variations on how to treat equality with null[^7].&lt;/p&gt;
+variations on how to treat equality with null&lt;sup id="fn7"&gt;&lt;a 
href="#footnote7"&gt;7&lt;/a&gt;.&lt;/sup&gt;&lt;/p&gt;
 &lt;p&gt;For a simple example, consider that a query like this:&lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT customer.name 
 FROM customer 
@@ -520,7 +522,7 @@ potentially (very) different performance. The major options 
in this category are
    estimates their cost and picks the one using the lowest cost.&lt;/p&gt;
 &lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;For some examples, you can read about [Spark&amp;rsquo;s cost-based 
optimizer] or look at
+&lt;p&gt;For some examples, you can read about &lt;a 
href="https://docs.databricks.com/aws/en/optimizations/cbo"&gt;Spark&amp;rsquo;s
 cost-based optimizer&lt;/a&gt; or look at
 the code for &lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;DataFusion&amp;rsquo;s
 join selection&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp"&gt;DuckDB&amp;rsquo;s
 cost model&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472"&gt;join
 order enumeration&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;However, the use of heuristics and (imprecise) cost models means 
optimizers must&lt;/p&gt;
@@ -552,7 +554,7 @@ Instead, keeping with its &lt;a 
href="https://docs.rs/datafusion/latest/datafusi
 implementation along with extension points to customize behavior.&lt;/p&gt;
 &lt;p&gt;Specifically, DataFusion includes&lt;/p&gt;
 &lt;ol&gt;
-&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query[^8]) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/li&gt;
+&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query&lt;sup id="fn8"&gt;&lt;a 
href="#footnote8"&gt;8&lt;/a&gt;) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/sup&gt;&lt;/li&gt;
 &lt;li&gt;Support for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html"&gt;ColumnStatistics&lt;/a&gt;
 and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html"&gt;Table
 Statistics&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;The framework for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity"&gt;filter
 selectivity&lt;/a&gt; + join cardinality estimation.&lt;/li&gt;
 &lt;li&gt;APIs for easily rewriting plans, such as the &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview"&gt;TreeNode
 API&lt;/a&gt; and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs"&gt;reordering
 joins&lt;/a&gt;&lt;/li&gt;
@@ -592,5 +594,5 @@ learning more about how they are designed and implemented, 
please &lt;a href="ht
 community&lt;/a&gt;. We welcome first time contributors as well as long time 
participants
 to the fun of building a database together.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^7]: See &lt;a 
href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf"&gt;Unnesting
 Arbitrary Queries&lt;/a&gt; from Neumann and Kemper for a more academic 
treatment.&lt;/p&gt;
-&lt;p&gt;[^8]: One of my favorite terms I learned from Andy Pavlo&amp;rsquo;s 
CMU online lectures&lt;/p&gt;</content><category 
term="blog"></category></entry></feed>
\ No newline at end of file
+&lt;p id="footnote7"&gt;&lt;sup&gt;[7]&lt;/sup&gt; See [Unnesting Arbitrary 
Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
 from Neumann and Kemper for a more academic treatment.&lt;/p&gt;
+&lt;p id="footnote8"&gt;&lt;sup&gt;[8]&lt;/sup&gt; One of my favorite terms I 
learned from Andy Pavlo&amp;rsquo;s CMU online 
lectures&lt;/p&gt;</content><category term="blog"></category></entry></feed>
\ No newline at end of file
diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml
index ddf2237..cca84e1 100644
--- a/blog/feeds/all-en.atom.xml
+++ b/blog/feeds/all-en.atom.xml
@@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this 
perception is because:&
 &lt;li&gt;
 &lt;p&gt;One must implement the rest of a database system (data storage, 
transactions,
    SQL parser, expression evaluation, plan execution, etc.) 
&lt;strong&gt;before&lt;/strong&gt; the
-   optimizer becomes critical[^5].&lt;/p&gt;
+   optimizer becomes critical&lt;sup id="fn5"&gt;&lt;a 
href="#footnote5"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
 &lt;/li&gt;
 &lt;li&gt;
 &lt;p&gt;Some parts of the optimizer are tightly tied to the rest of the 
system (e.g.,
@@ -150,11 +150,11 @@ choosing specific aggregation algorithms.&lt;/p&gt;
 &lt;h2&gt;Query Optimizer Implementation&lt;/h2&gt;
 &lt;p&gt;Industrial optimizers, such as 
 DataFusion&amp;rsquo;s (&lt;a 
href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src"&gt;source&lt;/a&gt;),
-ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,&lt;a
 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
+ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,
 &lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
 DuckDB (&lt;a 
href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer"&gt;source&lt;/a&gt;),
 and Apache Spark (&lt;a 
href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer"&gt;source&lt;/a&gt;),
 are implemented as a series of passes or rules that rewrite a query plan. The
-overall optimizer is composed of a sequence of these rules,[^6] as shown in
+overall optimizer is composed of a sequence of these rules,&lt;sup 
id="fn6"&gt;&lt;a href="#footnote6"&gt;6&lt;/a&gt;&lt;/sup&gt; as shown in
 Figure 4. The specific order of the rules also often matters, but we will not
 discuss this detail in this post.&lt;/p&gt;
 &lt;p&gt;A multi-pass design is standard because it helps:&lt;/p&gt;
@@ -199,8 +199,8 @@ DataFusion&lt;/a&gt; PMC member. A Database Optimizer
 connoisseur, he worked on the &lt;a 
href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf"&gt;Vertica 
Analytic
 Database&lt;/a&gt; Query
 Optimizer for six years, has several granted US patents related to query
-optimization[^1], co-authored several papers[^2]  about the topic (including in
-VLDB 2024&lt;a 
href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)"&gt;^3&lt;/a&gt;),
 and spent several weeks&lt;a 
href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]";
 
title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111
 [...]
+optimization&lt;sup id="fn1"&gt;&lt;a 
href="#footnote1"&gt;1&lt;/a&gt;&lt;/sup&gt;, co-authored several papers&lt;sup 
id="fn2"&gt;&lt;a href="#footnote2"&gt;2&lt;/a&gt;&lt;/sup&gt;  about the topic 
(including in
+VLDB 2024&lt;sup id="fn3"&gt;&lt;a 
href="#footnote3"&gt;3&lt;/a&gt;&lt;/sup&gt;), and spent several weeks&lt;sup 
id="fn4"&gt;&lt;a href="#footnote4"&gt;4&lt;/a&gt;&lt;/sup&gt; deeply geeking 
out about this topic
 with other experts (thank you Dagstuhl).&lt;/p&gt;
 &lt;p&gt;&lt;a href="https://www.linkedin.com/in/akurmustafa/"&gt;Mustafa 
Akur&lt;/a&gt; is a PhD Student at
 &lt;a href="https://www.ohsu.edu/"&gt;OHSU&lt;/a&gt; Knight Cancer Institute 
and an &lt;a href="https://datafusion.apache.org/"&gt;Apache
@@ -209,10 +209,12 @@ Software Developer at &lt;a 
href="https://www.synnada.ai/"&gt;Synnada&lt;/a&gt;
 significant features to the DataFusion optimizer, including many &lt;a 
href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/"&gt;sort-based
 optimizations&lt;/a&gt;.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^1]: &lt;em&gt;Modular Query Optimizer, US 8,312,027 &amp;middot; 
Issued Nov 13, 2012&lt;/em&gt;, Query Optimizer with schema conversion US 
8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
-&lt;p&gt;[^2]: &lt;a 
href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers"&gt;The
 Vertica Query Optimizer: The case for specialized Query 
Optimizers&lt;/a&gt;&lt;/p&gt;
-&lt;p&gt;[^5]: And thus in academic classes, by the time you get around to an 
optimizer the semester is over and everyone is ready for the semester to be 
done. Once industrial systems mature to the point where the optimizer is a 
bottleneck, the shiny new-ness of the&lt;a 
href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"&gt; hype 
cycle&lt;/a&gt; has worn off and it is likely in the trough of disappointment.  
&lt;/p&gt;
-&lt;p&gt;[^6]: Often systems will classify these passes into different 
categories, but I am simplifying here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho
 [...]
+&lt;p id="footnote1"&gt;&lt;sup&gt;[1]&lt;/sup&gt; *Modular Query Optimizer, 
US 8,312,027 &amp;middot; Issued Nov 13, 2012*, Query Optimizer with schema 
conversion US 8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
+&lt;p id="footnote2"&gt;&lt;sup&gt;[2]&lt;/sup&gt; [The Vertica Query 
Optimizer: The case for specialized Query 
Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)&lt;/p&gt;
+&lt;p id="footnote3"&gt;&lt;sup&gt;[3]&lt;/sup&gt; 
[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)&lt;/p&gt;
+&lt;p id="footnote4"&gt;&lt;sup&gt;[4]&lt;/sup&gt; 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111)
 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem
 [...]
+&lt;p id="footnote5"&gt;&lt;sup&gt;[5]&lt;/sup&gt;  And thus in academic 
classes, by the time you get around to an optimizer the semester is over and 
everyone is ready for the semester to be done. Once industrial systems mature 
to the point where the optimizer is a bottleneck, the shiny new-ness of the[ 
hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and 
it is likely in the trough of disappointment.&lt;/p&gt;
+&lt;p id="footnote6"&gt;&lt;sup&gt;[6]&lt;/sup&gt; Often systems will classify 
these passes into different categories, but I am simplifying 
here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the 
performance would be.
 Instead, analytic systems rewrite such queries into joins which can perform 
100s
 or 1000s of times faster for large datasets. However, transforming subqueries 
to
 joins requires &amp;ldquo;exotic&amp;rdquo; join semantics such as 
&lt;code&gt;SEMI JOIN&lt;/code&gt;, &lt;code&gt;ANTI JOIN&lt;/code&gt;  and
-variations on how to treat equality with null[^7].&lt;/p&gt;
+variations on how to treat equality with null&lt;sup id="fn7"&gt;&lt;a 
href="#footnote7"&gt;7&lt;/a&gt;.&lt;/sup&gt;&lt;/p&gt;
 &lt;p&gt;For a simple example, consider that a query like this:&lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT customer.name 
 FROM customer 
@@ -520,7 +522,7 @@ potentially (very) different performance. The major options 
in this category are
    estimates their cost and picks the one using the lowest cost.&lt;/p&gt;
 &lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;For some examples, you can read about [Spark&amp;rsquo;s cost-based 
optimizer] or look at
+&lt;p&gt;For some examples, you can read about &lt;a 
href="https://docs.databricks.com/aws/en/optimizations/cbo"&gt;Spark&amp;rsquo;s
 cost-based optimizer&lt;/a&gt; or look at
 the code for &lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;DataFusion&amp;rsquo;s
 join selection&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp"&gt;DuckDB&amp;rsquo;s
 cost model&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472"&gt;join
 order enumeration&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;However, the use of heuristics and (imprecise) cost models means 
optimizers must&lt;/p&gt;
@@ -552,7 +554,7 @@ Instead, keeping with its &lt;a 
href="https://docs.rs/datafusion/latest/datafusi
 implementation along with extension points to customize behavior.&lt;/p&gt;
 &lt;p&gt;Specifically, DataFusion includes&lt;/p&gt;
 &lt;ol&gt;
-&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query[^8]) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/li&gt;
+&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query&lt;sup id="fn8"&gt;&lt;a 
href="#footnote8"&gt;8&lt;/a&gt;) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/sup&gt;&lt;/li&gt;
 &lt;li&gt;Support for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html"&gt;ColumnStatistics&lt;/a&gt;
 and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html"&gt;Table
 Statistics&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;The framework for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity"&gt;filter
 selectivity&lt;/a&gt; + join cardinality estimation.&lt;/li&gt;
 &lt;li&gt;APIs for easily rewriting plans, such as the &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview"&gt;TreeNode
 API&lt;/a&gt; and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs"&gt;reordering
 joins&lt;/a&gt;&lt;/li&gt;
@@ -592,8 +594,8 @@ learning more about how they are designed and implemented, 
please &lt;a href="ht
 community&lt;/a&gt;. We welcome first time contributors as well as long time 
participants
 to the fun of building a database together.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^7]: See &lt;a 
href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf"&gt;Unnesting
 Arbitrary Queries&lt;/a&gt; from Neumann and Kemper for a more academic 
treatment.&lt;/p&gt;
-&lt;p&gt;[^8]: One of my favorite terms I learned from Andy Pavlo&amp;rsquo;s 
CMU online lectures&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 
Release</title><link 
href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0"; 
rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blo
 [...]
+&lt;p id="footnote7"&gt;&lt;sup&gt;[7]&lt;/sup&gt; See [Unnesting Arbitrary 
Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
 from Neumann and Kemper for a more academic treatment.&lt;/p&gt;
+&lt;p id="footnote8"&gt;&lt;sup&gt;[8]&lt;/sup&gt; One of my favorite terms I 
learned from Andy Pavlo&amp;rsquo;s CMU online 
lectures&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 
Release</title><link 
href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0"; 
rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:d
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml
index 54d2e3f..50e6c1f 100644
--- a/blog/feeds/blog.atom.xml
+++ b/blog/feeds/blog.atom.xml
@@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this 
perception is because:&
 &lt;li&gt;
 &lt;p&gt;One must implement the rest of a database system (data storage, 
transactions,
    SQL parser, expression evaluation, plan execution, etc.) 
&lt;strong&gt;before&lt;/strong&gt; the
-   optimizer becomes critical[^5].&lt;/p&gt;
+   optimizer becomes critical&lt;sup id="fn5"&gt;&lt;a 
href="#footnote5"&gt;5&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;
 &lt;/li&gt;
 &lt;li&gt;
 &lt;p&gt;Some parts of the optimizer are tightly tied to the rest of the 
system (e.g.,
@@ -150,11 +150,11 @@ choosing specific aggregation algorithms.&lt;/p&gt;
 &lt;h2&gt;Query Optimizer Implementation&lt;/h2&gt;
 &lt;p&gt;Industrial optimizers, such as 
 DataFusion&amp;rsquo;s (&lt;a 
href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src"&gt;source&lt;/a&gt;),
-ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,&lt;a
 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
+ClickHouse (&lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes"&gt;source&lt;/a&gt;,
 &lt;a 
href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations"&gt;source&lt;/a&gt;),
 DuckDB (&lt;a 
href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer"&gt;source&lt;/a&gt;),
 and Apache Spark (&lt;a 
href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer"&gt;source&lt;/a&gt;),
 are implemented as a series of passes or rules that rewrite a query plan. The
-overall optimizer is composed of a sequence of these rules,[^6] as shown in
+overall optimizer is composed of a sequence of these rules,&lt;sup 
id="fn6"&gt;&lt;a href="#footnote6"&gt;6&lt;/a&gt;&lt;/sup&gt; as shown in
 Figure 4. The specific order of the rules also often matters, but we will not
 discuss this detail in this post.&lt;/p&gt;
 &lt;p&gt;A multi-pass design is standard because it helps:&lt;/p&gt;
@@ -199,8 +199,8 @@ DataFusion&lt;/a&gt; PMC member. A Database Optimizer
 connoisseur, he worked on the &lt;a 
href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf"&gt;Vertica 
Analytic
 Database&lt;/a&gt; Query
 Optimizer for six years, has several granted US patents related to query
-optimization[^1], co-authored several papers[^2]  about the topic (including in
-VLDB 2024&lt;a 
href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)"&gt;^3&lt;/a&gt;),
 and spent several weeks&lt;a 
href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]";
 
title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111
 [...]
+optimization&lt;sup id="fn1"&gt;&lt;a 
href="#footnote1"&gt;1&lt;/a&gt;&lt;/sup&gt;, co-authored several papers&lt;sup 
id="fn2"&gt;&lt;a href="#footnote2"&gt;2&lt;/a&gt;&lt;/sup&gt;  about the topic 
(including in
+VLDB 2024&lt;sup id="fn3"&gt;&lt;a 
href="#footnote3"&gt;3&lt;/a&gt;&lt;/sup&gt;), and spent several weeks&lt;sup 
id="fn4"&gt;&lt;a href="#footnote4"&gt;4&lt;/a&gt;&lt;/sup&gt; deeply geeking 
out about this topic
 with other experts (thank you Dagstuhl).&lt;/p&gt;
 &lt;p&gt;&lt;a href="https://www.linkedin.com/in/akurmustafa/"&gt;Mustafa 
Akur&lt;/a&gt; is a PhD Student at
 &lt;a href="https://www.ohsu.edu/"&gt;OHSU&lt;/a&gt; Knight Cancer Institute 
and an &lt;a href="https://datafusion.apache.org/"&gt;Apache
@@ -209,10 +209,12 @@ Software Developer at &lt;a 
href="https://www.synnada.ai/"&gt;Synnada&lt;/a&gt;
 significant features to the DataFusion optimizer, including many &lt;a 
href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/"&gt;sort-based
 optimizations&lt;/a&gt;.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^1]: &lt;em&gt;Modular Query Optimizer, US 8,312,027 &amp;middot; 
Issued Nov 13, 2012&lt;/em&gt;, Query Optimizer with schema conversion US 
8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
-&lt;p&gt;[^2]: &lt;a 
href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers"&gt;The
 Vertica Query Optimizer: The case for specialized Query 
Optimizers&lt;/a&gt;&lt;/p&gt;
-&lt;p&gt;[^5]: And thus in academic classes, by the time you get around to an 
optimizer the semester is over and everyone is ready for the semester to be 
done. Once industrial systems mature to the point where the optimizer is a 
bottleneck, the shiny new-ness of the&lt;a 
href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"&gt; hype 
cycle&lt;/a&gt; has worn off and it is likely in the trough of disappointment.  
&lt;/p&gt;
-&lt;p&gt;[^6]: Often systems will classify these passes into different 
categories, but I am simplifying here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho
 [...]
+&lt;p id="footnote1"&gt;&lt;sup&gt;[1]&lt;/sup&gt; *Modular Query Optimizer, 
US 8,312,027 &amp;middot; Issued Nov 13, 2012*, Query Optimizer with schema 
conversion US 8,086,598 &amp;middot; Issued Dec 27, 2011&lt;/p&gt;
+&lt;p id="footnote2"&gt;&lt;sup&gt;[2]&lt;/sup&gt; [The Vertica Query 
Optimizer: The case for specialized Query 
Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)&lt;/p&gt;
+&lt;p id="footnote3"&gt;&lt;sup&gt;[3]&lt;/sup&gt; 
[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)&lt;/p&gt;
+&lt;p id="footnote4"&gt;&lt;sup&gt;[4]&lt;/sup&gt; 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101)
 , 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111)
 
[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem
 [...]
+&lt;p id="footnote5"&gt;&lt;sup&gt;[5]&lt;/sup&gt;  And thus in academic 
classes, by the time you get around to an optimizer the semester is over and 
everyone is ready for the semester to be done. Once industrial systems mature 
to the point where the optimizer is a bottleneck, the shiny new-ness of the[ 
hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and 
it is likely in the trough of disappointment.&lt;/p&gt;
+&lt;p id="footnote6"&gt;&lt;sup&gt;[6]&lt;/sup&gt; Often systems will classify 
these passes into different categories, but I am simplifying 
here&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in 
DataFusion, Part 2: Optimizers in Apache DataFusion</title><link 
href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two";
 
rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
@@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the 
performance would be.
 Instead, analytic systems rewrite such queries into joins which can perform 
100s
 or 1000s of times faster for large datasets. However, transforming subqueries 
to
 joins requires &amp;ldquo;exotic&amp;rdquo; join semantics such as 
&lt;code&gt;SEMI JOIN&lt;/code&gt;, &lt;code&gt;ANTI JOIN&lt;/code&gt;  and
-variations on how to treat equality with null[^7].&lt;/p&gt;
+variations on how to treat equality with null&lt;sup id="fn7"&gt;&lt;a 
href="#footnote7"&gt;7&lt;/a&gt;.&lt;/sup&gt;&lt;/p&gt;
 &lt;p&gt;For a simple example, consider that a query like this:&lt;/p&gt;
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT customer.name 
 FROM customer 
@@ -520,7 +522,7 @@ potentially (very) different performance. The major options 
in this category are
    estimates their cost and picks the one using the lowest cost.&lt;/p&gt;
 &lt;/li&gt;
 &lt;/ol&gt;
-&lt;p&gt;For some examples, you can read about [Spark&amp;rsquo;s cost-based 
optimizer] or look at
+&lt;p&gt;For some examples, you can read about &lt;a 
href="https://docs.databricks.com/aws/en/optimizations/cbo"&gt;Spark&amp;rsquo;s
 cost-based optimizer&lt;/a&gt; or look at
 the code for &lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;DataFusion&amp;rsquo;s
 join selection&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp"&gt;DuckDB&amp;rsquo;s
 cost model&lt;/a&gt; and &lt;a 
href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472"&gt;join
 order enumeration&lt;/a&gt;.&lt;/p&gt;
 &lt;p&gt;However, the use of heuristics and (imprecise) cost models means 
optimizers must&lt;/p&gt;
@@ -552,7 +554,7 @@ Instead, keeping with its &lt;a 
href="https://docs.rs/datafusion/latest/datafusi
 implementation along with extension points to customize behavior.&lt;/p&gt;
 &lt;p&gt;Specifically, DataFusion includes&lt;/p&gt;
 &lt;ol&gt;
-&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query[^8]) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/li&gt;
+&lt;li&gt;&amp;ldquo;Syntactic Optimizer&amp;rdquo; (joins in the order they 
are listed in the query&lt;sup id="fn8"&gt;&lt;a 
href="#footnote8"&gt;8&lt;/a&gt;) with basic join re-ordering (&lt;a 
href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs"&gt;source&lt;/a&gt;)
 to prevent join disasters.&lt;/sup&gt;&lt;/li&gt;
 &lt;li&gt;Support for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html"&gt;ColumnStatistics&lt;/a&gt;
 and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html"&gt;Table
 Statistics&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;The framework for &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity"&gt;filter
 selectivity&lt;/a&gt; + join cardinality estimation.&lt;/li&gt;
 &lt;li&gt;APIs for easily rewriting plans, such as the &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview"&gt;TreeNode
 API&lt;/a&gt; and &lt;a 
href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs"&gt;reordering
 joins&lt;/a&gt;&lt;/li&gt;
@@ -592,8 +594,8 @@ learning more about how they are designed and implemented, 
please &lt;a href="ht
 community&lt;/a&gt;. We welcome first time contributors as well as long time 
participants
 to the fun of building a database together.&lt;/p&gt;
 &lt;h2&gt;Notes&lt;/h2&gt;
-&lt;p&gt;[^7]: See &lt;a 
href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf"&gt;Unnesting
 Arbitrary Queries&lt;/a&gt; from Neumann and Kemper for a more academic 
treatment.&lt;/p&gt;
-&lt;p&gt;[^8]: One of my favorite terms I learned from Andy Pavlo&amp;rsquo;s 
CMU online lectures&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 
Release</title><link 
href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0"; 
rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blo
 [...]
+&lt;p id="footnote7"&gt;&lt;sup&gt;[7]&lt;/sup&gt; See [Unnesting Arbitrary 
Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf)
 from Neumann and Kemper for a more academic treatment.&lt;/p&gt;
+&lt;p id="footnote8"&gt;&lt;sup&gt;[8]&lt;/sup&gt; One of my favorite terms I 
learned from Andy Pavlo&amp;rsquo;s CMU online 
lectures&lt;/p&gt;</content><category 
term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 
Release</title><link 
href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0"; 
rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:d
 [...]
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org
For additional commands, e-mail: commits-h...@datafusion.apache.org

(datafusion-site) branch asf-staging updated: Commit build products

Reply via email to