This is an automated email from the ASF dual-hosted git repository. github-bot pushed a commit to branch asf-staging in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-staging by this push: new 837cdf7 Commit build products 837cdf7 is described below commit 837cdf7321be517b803d71adcf8017d3c9ce8298 Author: Build Pelican (action) <priv...@infra.apache.org> AuthorDate: Tue Jun 10 03:14:34 2025 +0000 Commit build products --- .../optimizing-sql-dataframes-part-one/index.html | 20 ++++++++------- .../optimizing-sql-dataframes-part-two/index.html | 10 ++++---- blog/feeds/alamb-akurmustafa.atom.xml | 30 ++++++++++++---------- blog/feeds/all-en.atom.xml | 30 ++++++++++++---------- blog/feeds/blog.atom.xml | 30 ++++++++++++---------- 5 files changed, 64 insertions(+), 56 deletions(-) diff --git a/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html b/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html index 7bc4b1a..93372ba 100644 --- a/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html +++ b/blog/2025/06/15/optimizing-sql-dataframes-part-one/index.html @@ -71,7 +71,7 @@ Pavlo, or some behind-the-scenes player. We believe this perception is because:< <li> <p>One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) <strong>before</strong> the - optimizer becomes critical[^5].</p> + optimizer becomes critical<sup id="fn5"><a href="#footnote5">5</a></sup>.</p> </li> <li> <p>Some parts of the optimizer are tightly tied to the rest of the system (e.g., @@ -168,11 +168,11 @@ choosing specific aggregation algorithms.</p> <h2>Query Optimizer Implementation</h2> <p>Industrial optimizers, such as DataFusion’s (<a href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src">source</a>), -ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>,<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), +ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>, <a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), DuckDB (<a href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer">source</a>), and Apache Spark (<a href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer">source</a>), are implemented as a series of passes or rules that rewrite a query plan. The -overall optimizer is composed of a sequence of these rules,[^6] as shown in +overall optimizer is composed of a sequence of these rules,<sup id="fn6"><a href="#footnote6">6</a></sup> as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.</p> <p>A multi-pass design is standard because it helps:</p> @@ -217,8 +217,8 @@ DataFusion</a> PMC member. A Database Optimizer connoisseur, he worked on the <a href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">Vertica Analytic Database</a> Query Optimizer for six years, has several granted US patents related to query -optimization[^1], co-authored several papers[^2] about the topic (including in -VLDB 2024<a href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)">^3</a>), and spent several weeks<a href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]" title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111) [https://www. [...] +optimization<sup id="fn1"><a href="#footnote1">1</a></sup>, co-authored several papers<sup id="fn2"><a href="#footnote2">2</a></sup> about the topic (including in +VLDB 2024<sup id="fn3"><a href="#footnote3">3</a></sup>), and spent several weeks<sup id="fn4"><a href="#footnote4">4</a></sup> deeply geeking out about this topic with other experts (thank you Dagstuhl).</p> <p><a href="https://www.linkedin.com/in/akurmustafa/">Mustafa Akur</a> is a PhD Student at <a href="https://www.ohsu.edu/">OHSU</a> Knight Cancer Institute and an <a href="https://datafusion.apache.org/">Apache @@ -227,10 +227,12 @@ Software Developer at <a href="https://www.synnada.ai/">Synnada</a> where he con significant features to the DataFusion optimizer, including many <a href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/">sort-based optimizations</a>.</p> <h2>Notes</h2> -<p>[^1]: <em>Modular Query Optimizer, US 8,312,027 · Issued Nov 13, 2012</em>, Query Optimizer with schema conversion US 8,086,598 · Issued Dec 27, 2011</p> -<p>[^2]: <a href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers">The Vertica Query Optimizer: The case for specialized Query Optimizers</a></p> -<p>[^5]: And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the<a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"> hype cycle</a> has worn off and it is likely in the trough of disappointment. </p> -<p>[^6]: Often systems will classify these passes into different categories, but I am simplifying here</p> +<p id="footnote1"><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 · Issued Nov 13, 2012*, Query Optimizer with schema conversion US 8,086,598 · Issued Dec 27, 2011</p> +<p id="footnote2"><sup>[2]</sup> [The Vertica Query Optimizer: The case for specialized Query Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)</p> +<p id="footnote3"><sup>[3]</sup> [https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)</p> +<p id="footnote4"><sup>[4]</sup> [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111) [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321)</p> +<p id="footnote5"><sup>[5]</sup> And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the[ hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is likely in the trough of disappointment.</p> +<p id="footnote6"><sup>[6]</sup> Often systems will classify these passes into different categories, but I am simplifying here</p> </div> </div> </div> diff --git a/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html b/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html index 2235435..15a1668 100644 --- a/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html +++ b/blog/2025/06/15/optimizing-sql-dataframes-part-two/index.html @@ -197,7 +197,7 @@ support row-at-a-time evaluation given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires “exotic” join semantics such as <code>SEMI JOIN</code>, <code>ANTI JOIN</code> and -variations on how to treat equality with null[^7].</p> +variations on how to treat equality with null<sup id="fn7"><a href="#footnote7">7</a>.</sup></p> <p>For a simple example, consider that a query like this:</p> <pre><code class="language-sql">SELECT customer.name FROM customer @@ -329,7 +329,7 @@ potentially (very) different performance. The major options in this category are estimates their cost and picks the one using the lowest cost.</p> </li> </ol> -<p>For some examples, you can read about [Spark’s cost-based optimizer] or look at +<p>For some examples, you can read about <a href="https://docs.databricks.com/aws/en/optimizations/cbo">Spark’s cost-based optimizer</a> or look at the code for <a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">DataFusion’s join selection</a> and <a href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp">DuckDB’s cost model</a> and <a href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472">join order enumeration</a>.</p> <p>However, the use of heuristics and (imprecise) cost models means optimizers must</p> @@ -361,7 +361,7 @@ Instead, keeping with its <a href="https://docs.rs/datafusion/latest/datafusion/ implementation along with extension points to customize behavior.</p> <p>Specifically, DataFusion includes</p> <ol> -<li>“Syntactic Optimizer” (joins in the order they are listed in the query[^8]) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</li> +<li>“Syntactic Optimizer” (joins in the order they are listed in the query<sup id="fn8"><a href="#footnote8">8</a>) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</sup></li> <li>Support for <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html">ColumnStatistics</a> and <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html">Table Statistics</a></li> <li>The framework for <a href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity">filter selectivity</a> + join cardinality estimation.</li> <li>APIs for easily rewriting plans, such as the <a href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview">TreeNode API</a> and <a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs">reordering joins</a></li> @@ -401,8 +401,8 @@ learning more about how they are designed and implemented, please <a href="https community</a>. We welcome first time contributors as well as long time participants to the fun of building a database together.</p> <h2>Notes</h2> -<p>[^7]: See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p>[^8]: One of my favorite terms I learned from Andy Pavlo’s CMU online lectures</p> +<p id="footnote7"><sup>[7]</sup> See [Unnesting Arbitrary Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) from Neumann and Kemper for a more academic treatment.</p> +<p id="footnote8"><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo’s CMU online lectures</p> </div> </div> </div> diff --git a/blog/feeds/alamb-akurmustafa.atom.xml b/blog/feeds/alamb-akurmustafa.atom.xml index d4a265f..2d45f81 100644 --- a/blog/feeds/alamb-akurmustafa.atom.xml +++ b/blog/feeds/alamb-akurmustafa.atom.xml @@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this perception is because:& <li> <p>One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) <strong>before</strong> the - optimizer becomes critical[^5].</p> + optimizer becomes critical<sup id="fn5"><a href="#footnote5">5</a></sup>.</p> </li> <li> <p>Some parts of the optimizer are tightly tied to the rest of the system (e.g., @@ -150,11 +150,11 @@ choosing specific aggregation algorithms.</p> <h2>Query Optimizer Implementation</h2> <p>Industrial optimizers, such as DataFusion&rsquo;s (<a href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src">source</a>), -ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>,<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), +ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>, <a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), DuckDB (<a href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer">source</a>), and Apache Spark (<a href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer">source</a>), are implemented as a series of passes or rules that rewrite a query plan. The -overall optimizer is composed of a sequence of these rules,[^6] as shown in +overall optimizer is composed of a sequence of these rules,<sup id="fn6"><a href="#footnote6">6</a></sup> as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.</p> <p>A multi-pass design is standard because it helps:</p> @@ -199,8 +199,8 @@ DataFusion</a> PMC member. A Database Optimizer connoisseur, he worked on the <a href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">Vertica Analytic Database</a> Query Optimizer for six years, has several granted US patents related to query -optimization[^1], co-authored several papers[^2] about the topic (including in -VLDB 2024<a href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)">^3</a>), and spent several weeks<a href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]" title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111 [...] +optimization<sup id="fn1"><a href="#footnote1">1</a></sup>, co-authored several papers<sup id="fn2"><a href="#footnote2">2</a></sup> about the topic (including in +VLDB 2024<sup id="fn3"><a href="#footnote3">3</a></sup>), and spent several weeks<sup id="fn4"><a href="#footnote4">4</a></sup> deeply geeking out about this topic with other experts (thank you Dagstuhl).</p> <p><a href="https://www.linkedin.com/in/akurmustafa/">Mustafa Akur</a> is a PhD Student at <a href="https://www.ohsu.edu/">OHSU</a> Knight Cancer Institute and an <a href="https://datafusion.apache.org/">Apache @@ -209,10 +209,12 @@ Software Developer at <a href="https://www.synnada.ai/">Synnada</a> significant features to the DataFusion optimizer, including many <a href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/">sort-based optimizations</a>.</p> <h2>Notes</h2> -<p>[^1]: <em>Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012</em>, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> -<p>[^2]: <a href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers">The Vertica Query Optimizer: The case for specialized Query Optimizers</a></p> -<p>[^5]: And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the<a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"> hype cycle</a> has worn off and it is likely in the trough of disappointment. </p> -<p>[^6]: Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho [...] +<p id="footnote1"><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012*, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> +<p id="footnote2"><sup>[2]</sup> [The Vertica Query Optimizer: The case for specialized Query Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)</p> +<p id="footnote3"><sup>[3]</sup> [https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)</p> +<p id="footnote4"><sup>[4]</sup> [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111) [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem [...] +<p id="footnote5"><sup>[5]</sup> And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the[ hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is likely in the trough of disappointment.</p> +<p id="footnote6"><sup>[6]</sup> Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025- [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires &ldquo;exotic&rdquo; join semantics such as <code>SEMI JOIN</code>, <code>ANTI JOIN</code> and -variations on how to treat equality with null[^7].</p> +variations on how to treat equality with null<sup id="fn7"><a href="#footnote7">7</a>.</sup></p> <p>For a simple example, consider that a query like this:</p> <pre><code class="language-sql">SELECT customer.name FROM customer @@ -520,7 +522,7 @@ potentially (very) different performance. The major options in this category are estimates their cost and picks the one using the lowest cost.</p> </li> </ol> -<p>For some examples, you can read about [Spark&rsquo;s cost-based optimizer] or look at +<p>For some examples, you can read about <a href="https://docs.databricks.com/aws/en/optimizations/cbo">Spark&rsquo;s cost-based optimizer</a> or look at the code for <a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">DataFusion&rsquo;s join selection</a> and <a href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp">DuckDB&rsquo;s cost model</a> and <a href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472">join order enumeration</a>.</p> <p>However, the use of heuristics and (imprecise) cost models means optimizers must</p> @@ -552,7 +554,7 @@ Instead, keeping with its <a href="https://docs.rs/datafusion/latest/datafusi implementation along with extension points to customize behavior.</p> <p>Specifically, DataFusion includes</p> <ol> -<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query[^8]) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</li> +<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query<sup id="fn8"><a href="#footnote8">8</a>) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</sup></li> <li>Support for <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html">ColumnStatistics</a> and <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html">Table Statistics</a></li> <li>The framework for <a href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity">filter selectivity</a> + join cardinality estimation.</li> <li>APIs for easily rewriting plans, such as the <a href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview">TreeNode API</a> and <a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs">reordering joins</a></li> @@ -592,5 +594,5 @@ learning more about how they are designed and implemented, please <a href="ht community</a>. We welcome first time contributors as well as long time participants to the fun of building a database together.</p> <h2>Notes</h2> -<p>[^7]: See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p>[^8]: One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry></feed> \ No newline at end of file +<p id="footnote7"><sup>[7]</sup> See [Unnesting Arbitrary Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) from Neumann and Kemper for a more academic treatment.</p> +<p id="footnote8"><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry></feed> \ No newline at end of file diff --git a/blog/feeds/all-en.atom.xml b/blog/feeds/all-en.atom.xml index ddf2237..cca84e1 100644 --- a/blog/feeds/all-en.atom.xml +++ b/blog/feeds/all-en.atom.xml @@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this perception is because:& <li> <p>One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) <strong>before</strong> the - optimizer becomes critical[^5].</p> + optimizer becomes critical<sup id="fn5"><a href="#footnote5">5</a></sup>.</p> </li> <li> <p>Some parts of the optimizer are tightly tied to the rest of the system (e.g., @@ -150,11 +150,11 @@ choosing specific aggregation algorithms.</p> <h2>Query Optimizer Implementation</h2> <p>Industrial optimizers, such as DataFusion&rsquo;s (<a href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src">source</a>), -ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>,<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), +ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>, <a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), DuckDB (<a href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer">source</a>), and Apache Spark (<a href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer">source</a>), are implemented as a series of passes or rules that rewrite a query plan. The -overall optimizer is composed of a sequence of these rules,[^6] as shown in +overall optimizer is composed of a sequence of these rules,<sup id="fn6"><a href="#footnote6">6</a></sup> as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.</p> <p>A multi-pass design is standard because it helps:</p> @@ -199,8 +199,8 @@ DataFusion</a> PMC member. A Database Optimizer connoisseur, he worked on the <a href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">Vertica Analytic Database</a> Query Optimizer for six years, has several granted US patents related to query -optimization[^1], co-authored several papers[^2] about the topic (including in -VLDB 2024<a href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)">^3</a>), and spent several weeks<a href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]" title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111 [...] +optimization<sup id="fn1"><a href="#footnote1">1</a></sup>, co-authored several papers<sup id="fn2"><a href="#footnote2">2</a></sup> about the topic (including in +VLDB 2024<sup id="fn3"><a href="#footnote3">3</a></sup>), and spent several weeks<sup id="fn4"><a href="#footnote4">4</a></sup> deeply geeking out about this topic with other experts (thank you Dagstuhl).</p> <p><a href="https://www.linkedin.com/in/akurmustafa/">Mustafa Akur</a> is a PhD Student at <a href="https://www.ohsu.edu/">OHSU</a> Knight Cancer Institute and an <a href="https://datafusion.apache.org/">Apache @@ -209,10 +209,12 @@ Software Developer at <a href="https://www.synnada.ai/">Synnada</a> significant features to the DataFusion optimizer, including many <a href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/">sort-based optimizations</a>.</p> <h2>Notes</h2> -<p>[^1]: <em>Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012</em>, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> -<p>[^2]: <a href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers">The Vertica Query Optimizer: The case for specialized Query Optimizers</a></p> -<p>[^5]: And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the<a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"> hype cycle</a> has worn off and it is likely in the trough of disappointment. </p> -<p>[^6]: Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho [...] +<p id="footnote1"><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012*, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> +<p id="footnote2"><sup>[2]</sup> [The Vertica Query Optimizer: The case for specialized Query Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)</p> +<p id="footnote3"><sup>[3]</sup> [https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)</p> +<p id="footnote4"><sup>[4]</sup> [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111) [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem [...] +<p id="footnote5"><sup>[5]</sup> And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the[ hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is likely in the trough of disappointment.</p> +<p id="footnote6"><sup>[6]</sup> Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025- [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires &ldquo;exotic&rdquo; join semantics such as <code>SEMI JOIN</code>, <code>ANTI JOIN</code> and -variations on how to treat equality with null[^7].</p> +variations on how to treat equality with null<sup id="fn7"><a href="#footnote7">7</a>.</sup></p> <p>For a simple example, consider that a query like this:</p> <pre><code class="language-sql">SELECT customer.name FROM customer @@ -520,7 +522,7 @@ potentially (very) different performance. The major options in this category are estimates their cost and picks the one using the lowest cost.</p> </li> </ol> -<p>For some examples, you can read about [Spark&rsquo;s cost-based optimizer] or look at +<p>For some examples, you can read about <a href="https://docs.databricks.com/aws/en/optimizations/cbo">Spark&rsquo;s cost-based optimizer</a> or look at the code for <a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">DataFusion&rsquo;s join selection</a> and <a href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp">DuckDB&rsquo;s cost model</a> and <a href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472">join order enumeration</a>.</p> <p>However, the use of heuristics and (imprecise) cost models means optimizers must</p> @@ -552,7 +554,7 @@ Instead, keeping with its <a href="https://docs.rs/datafusion/latest/datafusi implementation along with extension points to customize behavior.</p> <p>Specifically, DataFusion includes</p> <ol> -<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query[^8]) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</li> +<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query<sup id="fn8"><a href="#footnote8">8</a>) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</sup></li> <li>Support for <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html">ColumnStatistics</a> and <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html">Table Statistics</a></li> <li>The framework for <a href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity">filter selectivity</a> + join cardinality estimation.</li> <li>APIs for easily rewriting plans, such as the <a href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview">TreeNode API</a> and <a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs">reordering joins</a></li> @@ -592,8 +594,8 @@ learning more about how they are designed and implemented, please <a href="ht community</a>. We welcome first time contributors as well as long time participants to the fun of building a database together.</p> <h2>Notes</h2> -<p>[^7]: See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p>[^8]: One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blo [...] +<p id="footnote7"><sup>[7]</sup> See [Unnesting Arbitrary Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) from Neumann and Kemper for a more academic treatment.</p> +<p id="footnote8"><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:d [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with diff --git a/blog/feeds/blog.atom.xml b/blog/feeds/blog.atom.xml index 54d2e3f..50e6c1f 100644 --- a/blog/feeds/blog.atom.xml +++ b/blog/feeds/blog.atom.xml @@ -53,7 +53,7 @@ Pavlo, or some behind-the-scenes player. We believe this perception is because:& <li> <p>One must implement the rest of a database system (data storage, transactions, SQL parser, expression evaluation, plan execution, etc.) <strong>before</strong> the - optimizer becomes critical[^5].</p> + optimizer becomes critical<sup id="fn5"><a href="#footnote5">5</a></sup>.</p> </li> <li> <p>Some parts of the optimizer are tightly tied to the rest of the system (e.g., @@ -150,11 +150,11 @@ choosing specific aggregation algorithms.</p> <h2>Query Optimizer Implementation</h2> <p>Industrial optimizers, such as DataFusion&rsquo;s (<a href="https://github.com/apache/datafusion/tree/334d6ec50f36659403c96e1bffef4228be7c458e/datafusion/optimizer/src">source</a>), -ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>,<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), +ClickHouse (<a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Analyzer/Passes">source</a>, <a href="https://github.com/ClickHouse/ClickHouse/tree/master/src/Processors/QueryPlan/Optimizations">source</a>), DuckDB (<a href="https://github.com/duckdb/duckdb/tree/4afa85c6a4dacc39524d1649fd8eb8c19c28ad14/src/optimizer">source</a>), and Apache Spark (<a href="https://github.com/apache/spark/tree/7bc8e99cde424c59b98fe915e3fdaaa30beadb76/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer">source</a>), are implemented as a series of passes or rules that rewrite a query plan. The -overall optimizer is composed of a sequence of these rules,[^6] as shown in +overall optimizer is composed of a sequence of these rules,<sup id="fn6"><a href="#footnote6">6</a></sup> as shown in Figure 4. The specific order of the rules also often matters, but we will not discuss this detail in this post.</p> <p>A multi-pass design is standard because it helps:</p> @@ -199,8 +199,8 @@ DataFusion</a> PMC member. A Database Optimizer connoisseur, he worked on the <a href="https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf">Vertica Analytic Database</a> Query Optimizer for six years, has several granted US patents related to query -optimization[^1], co-authored several papers[^2] about the topic (including in -VLDB 2024<a href="[https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)">^3</a>), and spent several weeks<a href="[https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101]" title="https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111 [...] +optimization<sup id="fn1"><a href="#footnote1">1</a></sup>, co-authored several papers<sup id="fn2"><a href="#footnote2">2</a></sup> about the topic (including in +VLDB 2024<sup id="fn3"><a href="#footnote3">3</a></sup>), and spent several weeks<sup id="fn4"><a href="#footnote4">4</a></sup> deeply geeking out about this topic with other experts (thank you Dagstuhl).</p> <p><a href="https://www.linkedin.com/in/akurmustafa/">Mustafa Akur</a> is a PhD Student at <a href="https://www.ohsu.edu/">OHSU</a> Knight Cancer Institute and an <a href="https://datafusion.apache.org/">Apache @@ -209,10 +209,12 @@ Software Developer at <a href="https://www.synnada.ai/">Synnada</a> significant features to the DataFusion optimizer, including many <a href="https://datafusion.apache.org/blog/2025/03/11/ordering-analysis/">sort-based optimizations</a>.</p> <h2>Notes</h2> -<p>[^1]: <em>Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012</em>, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> -<p>[^2]: <a href="https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers">The Vertica Query Optimizer: The case for specialized Query Optimizers</a></p> -<p>[^5]: And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the<a href="https://en.wikipedia.org/wiki/Gartner_hype_cycle"> hype cycle</a> has worn off and it is likely in the trough of disappointment. </p> -<p>[^6]: Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025-06-15T00:00:00+00:00</updated><autho [...] +<p id="footnote1"><sup>[1]</sup> *Modular Query Optimizer, US 8,312,027 &middot; Issued Nov 13, 2012*, Query Optimizer with schema conversion US 8,086,598 &middot; Issued Dec 27, 2011</p> +<p id="footnote2"><sup>[2]</sup> [The Vertica Query Optimizer: The case for specialized Query Optimizers](https://www.researchgate.net/publication/269306314_The_Vertica_Query_Optimizer_The_case_for_specialized_query_optimizers)</p> +<p id="footnote3"><sup>[3]</sup> [https://www.vldb.org/pvldb/vol17/p1350-justen.pdf](https://www.vldb.org/pvldb/vol17/p1350-justen.pdf)</p> +<p id="footnote4"><sup>[4]</sup> [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/24101) , [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111](https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/22111) [https://www.dagstuhl.de/en/seminars/seminar-calendar/seminar-details/12321](https://www.dagstuhl.de/en/seminars/seminar-calendar/sem [...] +<p id="footnote5"><sup>[5]</sup> And thus in academic classes, by the time you get around to an optimizer the semester is over and everyone is ready for the semester to be done. Once industrial systems mature to the point where the optimizer is a bottleneck, the shiny new-ness of the[ hype cycle](https://en.wikipedia.org/wiki/Gartner_hype_cycle) has worn off and it is likely in the trough of disappointment.</p> +<p id="footnote6"><sup>[6]</sup> Often systems will classify these passes into different categories, but I am simplifying here</p></content><category term="blog"></category></entry><entry><title>Optimizing SQL (and DataFrames) in DataFusion, Part 2: Optimizers in Apache DataFusion</title><link href="https://datafusion.apache.org/blog/2025/06/15/optimizing-sql-dataframes-part-two" rel="alternate"></link><published>2025-06-15T00:00:00+00:00</published><updated>2025- [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -388,7 +390,7 @@ support row-at-a-time evaluation given how terrible the performance would be. Instead, analytic systems rewrite such queries into joins which can perform 100s or 1000s of times faster for large datasets. However, transforming subqueries to joins requires &ldquo;exotic&rdquo; join semantics such as <code>SEMI JOIN</code>, <code>ANTI JOIN</code> and -variations on how to treat equality with null[^7].</p> +variations on how to treat equality with null<sup id="fn7"><a href="#footnote7">7</a>.</sup></p> <p>For a simple example, consider that a query like this:</p> <pre><code class="language-sql">SELECT customer.name FROM customer @@ -520,7 +522,7 @@ potentially (very) different performance. The major options in this category are estimates their cost and picks the one using the lowest cost.</p> </li> </ol> -<p>For some examples, you can read about [Spark&rsquo;s cost-based optimizer] or look at +<p>For some examples, you can read about <a href="https://docs.databricks.com/aws/en/optimizations/cbo">Spark&rsquo;s cost-based optimizer</a> or look at the code for <a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">DataFusion&rsquo;s join selection</a> and <a href="https://github.com/duckdb/duckdb/blob/main/src/optimizer/join_order/cost_model.cpp">DuckDB&rsquo;s cost model</a> and <a href="https://github.com/duckdb/duckdb/blob/84c87b12fa9554a8775dc243b4d0afd5b407321a/src/optimizer/join_order/plan_enumerator.cpp#L469-L472">join order enumeration</a>.</p> <p>However, the use of heuristics and (imprecise) cost models means optimizers must</p> @@ -552,7 +554,7 @@ Instead, keeping with its <a href="https://docs.rs/datafusion/latest/datafusi implementation along with extension points to customize behavior.</p> <p>Specifically, DataFusion includes</p> <ol> -<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query[^8]) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</li> +<li>&ldquo;Syntactic Optimizer&rdquo; (joins in the order they are listed in the query<sup id="fn8"><a href="#footnote8">8</a>) with basic join re-ordering (<a href="https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/join_selection.rs">source</a>) to prevent join disasters.</sup></li> <li>Support for <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.ColumnStatistics.html">ColumnStatistics</a> and <a href="https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html">Table Statistics</a></li> <li>The framework for <a href="https://docs.rs/datafusion/latest/datafusion/physical_expr/struct.AnalysisContext.html#structfield.selectivity">filter selectivity</a> + join cardinality estimation.</li> <li>APIs for easily rewriting plans, such as the <a href="https://docs.rs/datafusion/latest/datafusion/common/tree_node/trait.TreeNode.html#overview">TreeNode API</a> and <a href="https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.HashJoinExec.html#method.swap_inputs">reordering joins</a></li> @@ -592,8 +594,8 @@ learning more about how they are designed and implemented, please <a href="ht community</a>. We welcome first time contributors as well as long time participants to the fun of building a database together.</p> <h2>Notes</h2> -<p>[^7]: See <a href="https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf">Unnesting Arbitrary Queries</a> from Neumann and Kemper for a more academic treatment.</p> -<p>[^8]: One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:datafusion.apache.org,2025-05-06:/blo [...] +<p id="footnote7"><sup>[7]</sup> See [Unnesting Arbitrary Queries](https://btw-2015.informatik.uni-hamburg.de/res/proceedings/Hauptband/Wiss/Neumann-Unnesting_Arbitrary_Querie.pdf) from Neumann and Kemper for a more academic treatment.</p> +<p id="footnote8"><sup>[8]</sup> One of my favorite terms I learned from Andy Pavlo&rsquo;s CMU online lectures</p></content><category term="blog"></category></entry><entry><title>Apache DataFusion Comet 0.8.0 Release</title><link href="https://datafusion.apache.org/blog/2025/05/06/datafusion-comet-0.8.0" rel="alternate"></link><published>2025-05-06T00:00:00+00:00</published><updated>2025-05-06T00:00:00+00:00</updated><author><name>pmc</name></author><id>tag:d [...] {% comment %} Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@datafusion.apache.org For additional commands, e-mail: commits-h...@datafusion.apache.org