This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/datafusion-site.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 4f92110 Commit build products
4f92110 is described below
commit 4f92110d2ffe36d027e2d9360b072a9f92d4d2d8
Author: Build Pelican (action) <[email protected]>
AuthorDate: Wed Mar 19 18:14:36 2025 +0000
Commit build products
---
output/2025/03/11/ordering-analysis/index.html | 42 ++++++++++++++++++++-
output/feeds/all-en.atom.xml | 42 ++++++++++++++++++++-
output/feeds/blog.atom.xml | 42 ++++++++++++++++++++-
output/feeds/mustafa-akur-andrew-lamb.atom.xml | 42 ++++++++++++++++++++-
.../images/ordering_analysis/query_window_plan.png | Bin 0 -> 189377 bytes
5 files changed, 164 insertions(+), 4 deletions(-)
diff --git a/output/2025/03/11/ordering-analysis/index.html
b/output/2025/03/11/ordering-analysis/index.html
index f9760fa..c829a2c 100644
--- a/output/2025/03/11/ordering-analysis/index.html
+++ b/output/2025/03/11/ordering-analysis/index.html
@@ -146,7 +146,7 @@ Being logically streamable does not guarantee that a query
will execute in a str
</table>
<p><br/></p>
<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
-<strong>How can a table have multiple orderings?:</strong> At first glance it
may seem counterintuitive for a table to have more than one valid ordering.
However, during query execution such scenarios can arise.
+<strong>How can a table have multiple orderings?</strong> At first glance it
may seem counterintuitive for a table to have more than one valid ordering.
However, during query execution such scenarios can arise.
For example consider the following query:
@@ -293,6 +293,46 @@ Following third and fourth constraints for the simplified
table, the succinct va
<code>[amount ASC, price ASC]</code>, <br/>
<code>[time_bin ASC]</code>,<br/>
<code>[time ASC]</code> </p>
+<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
+<p><strong>How can DataFusion find orderings?</strong></p>
+DataFusion's <code>CREATE EXTERNAL TABLE</code> has a <code>WITH ORDER</code>
clause (see <a
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>)
to specify the known orderings of the table during table creation. For example
the following query:<br/>
+<pre><code>
+CREATE EXTERNAL TABLE source (
+ amount INT NOT NULL,
+ price DOUBLE NOT NULL,
+ time TIMESTAMP NOT NULL,
+ ...
+)
+STORED AS CSV
+WITH ORDER (time ASC)
+WITH ORDER (amount ASC, price ASC)
+LOCATION '/path/to/FILE_NAME.csv'
+OPTIONS ('has_header' 'true');
+</code></pre>
+communicates that <code>source</code> table has the orderings: <code>[time
ASC]</code> and <code>[amount ASC, price ASC]</code>.<br/>
+When orderings are communicated from the source, DataFusion tracks the
orderings through each operator while optimizing the plan.<br/>
+<ul>
+<li>add new orderings (such as when "date_bin" function is applied to the
"time" column)</li>
+<li>Remove orderings, if operation doesn't preserve the ordering of the data
at its input</li>
+<li>Update equivalent groups</li>
+<li>Update constant expressions</li>
+</ul>
+
+Figure 1 shows an example how DataFusion generates an efficient plan for the
query:
+<pre><code>
+SELECT
+ row_number() OVER (ORDER BY time) as rn,
+ time
+FROM events
+ORDER BY rn, time
+</code></pre>
+using the orderings of the query intermediates.<br/>
+<br/>
+<figure>
+<img alt="Window Query Datafusion Optimization" class="img-responsive"
src="/blog/images/ordering_analysis/query_window_plan.png" width="80%"/>
+<figcaption><strong>Figure 1:</strong> DataFusion analyzes orderings of the
sources and query intermediates to generate efficient plans</figcaption>
+</figure>
+</blockquote>
<h3>Table Properties</h3>
<p>In summary, for the example table, the following properties correctly
describe the sort properties:</p>
<ul>
diff --git a/output/feeds/all-en.atom.xml b/output/feeds/all-en.atom.xml
index 1a4efff..1c61f3b 100644
--- a/output/feeds/all-en.atom.xml
+++ b/output/feeds/all-en.atom.xml
@@ -124,7 +124,7 @@ Being logically streamable does not guarantee that a query
will execute in a str
</table>
<p><br/></p>
<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
-<strong>How can a table have multiple orderings?:</strong> At
first glance it may seem counterintuitive for a table to have more than one
valid ordering. However, during query execution such scenarios can arise.
+<strong>How can a table have multiple orderings?</strong> At first
glance it may seem counterintuitive for a table to have more than one valid
ordering. However, during query execution such scenarios can arise.
For example consider the following query:
@@ -271,6 +271,46 @@ Following third and fourth constraints for the simplified
table, the succinct va
<code>[amount ASC, price ASC]</code>, <br/>
<code>[time_bin ASC]</code>,<br/>
<code>[time ASC]</code> </p>
+<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
+<p><strong>How can DataFusion find
orderings?</strong></p>
+DataFusion's <code>CREATE EXTERNAL TABLE</code> has a
<code>WITH ORDER</code> clause (see <a
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>)
to specify the known orderings of the table during table creation. For example
the following query:<br/>
+<pre><code>
+CREATE EXTERNAL TABLE source (
+ amount INT NOT NULL,
+ price DOUBLE NOT NULL,
+ time TIMESTAMP NOT NULL,
+ ...
+)
+STORED AS CSV
+WITH ORDER (time ASC)
+WITH ORDER (amount ASC, price ASC)
+LOCATION '/path/to/FILE_NAME.csv'
+OPTIONS ('has_header' 'true');
+</code></pre>
+communicates that <code>source</code> table has the orderings:
<code>[time ASC]</code> and <code>[amount ASC, price
ASC]</code>.<br/>
+When orderings are communicated from the source, DataFusion tracks the
orderings through each operator while optimizing the plan.<br/>
+<ul>
+<li>add new orderings (such as when "date_bin" function is applied to
the "time" column)</li>
+<li>Remove orderings, if operation doesn't preserve the ordering of the
data at its input</li>
+<li>Update equivalent groups</li>
+<li>Update constant expressions</li>
+</ul>
+
+Figure 1 shows an example how DataFusion generates an efficient plan for the
query:
+<pre><code>
+SELECT
+ row_number() OVER (ORDER BY time) as rn,
+ time
+FROM events
+ORDER BY rn, time
+</code></pre>
+using the orderings of the query intermediates.<br/>
+<br/>
+<figure>
+<img alt="Window Query Datafusion Optimization" class="img-responsive"
src="/blog/images/ordering_analysis/query_window_plan.png" width="80%"/>
+<figcaption><strong>Figure 1:</strong> DataFusion analyzes
orderings of the sources and query intermediates to generate efficient
plans</figcaption>
+</figure>
+</blockquote>
<h3>Table Properties</h3>
<p>In summary, for the example table, the following properties correctly
describe the sort properties:</p>
<ul>
diff --git a/output/feeds/blog.atom.xml b/output/feeds/blog.atom.xml
index a9119e1..67f1633 100644
--- a/output/feeds/blog.atom.xml
+++ b/output/feeds/blog.atom.xml
@@ -124,7 +124,7 @@ Being logically streamable does not guarantee that a query
will execute in a str
</table>
<p><br/></p>
<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
-<strong>How can a table have multiple orderings?:</strong> At
first glance it may seem counterintuitive for a table to have more than one
valid ordering. However, during query execution such scenarios can arise.
+<strong>How can a table have multiple orderings?</strong> At first
glance it may seem counterintuitive for a table to have more than one valid
ordering. However, during query execution such scenarios can arise.
For example consider the following query:
@@ -271,6 +271,46 @@ Following third and fourth constraints for the simplified
table, the succinct va
<code>[amount ASC, price ASC]</code>, <br/>
<code>[time_bin ASC]</code>,<br/>
<code>[time ASC]</code> </p>
+<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
+<p><strong>How can DataFusion find
orderings?</strong></p>
+DataFusion's <code>CREATE EXTERNAL TABLE</code> has a
<code>WITH ORDER</code> clause (see <a
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>)
to specify the known orderings of the table during table creation. For example
the following query:<br/>
+<pre><code>
+CREATE EXTERNAL TABLE source (
+ amount INT NOT NULL,
+ price DOUBLE NOT NULL,
+ time TIMESTAMP NOT NULL,
+ ...
+)
+STORED AS CSV
+WITH ORDER (time ASC)
+WITH ORDER (amount ASC, price ASC)
+LOCATION '/path/to/FILE_NAME.csv'
+OPTIONS ('has_header' 'true');
+</code></pre>
+communicates that <code>source</code> table has the orderings:
<code>[time ASC]</code> and <code>[amount ASC, price
ASC]</code>.<br/>
+When orderings are communicated from the source, DataFusion tracks the
orderings through each operator while optimizing the plan.<br/>
+<ul>
+<li>add new orderings (such as when "date_bin" function is applied to
the "time" column)</li>
+<li>Remove orderings, if operation doesn't preserve the ordering of the
data at its input</li>
+<li>Update equivalent groups</li>
+<li>Update constant expressions</li>
+</ul>
+
+Figure 1 shows an example how DataFusion generates an efficient plan for the
query:
+<pre><code>
+SELECT
+ row_number() OVER (ORDER BY time) as rn,
+ time
+FROM events
+ORDER BY rn, time
+</code></pre>
+using the orderings of the query intermediates.<br/>
+<br/>
+<figure>
+<img alt="Window Query Datafusion Optimization" class="img-responsive"
src="/blog/images/ordering_analysis/query_window_plan.png" width="80%"/>
+<figcaption><strong>Figure 1:</strong> DataFusion analyzes
orderings of the sources and query intermediates to generate efficient
plans</figcaption>
+</figure>
+</blockquote>
<h3>Table Properties</h3>
<p>In summary, for the example table, the following properties correctly
describe the sort properties:</p>
<ul>
diff --git a/output/feeds/mustafa-akur-andrew-lamb.atom.xml
b/output/feeds/mustafa-akur-andrew-lamb.atom.xml
index da0b115..d9ee35a 100644
--- a/output/feeds/mustafa-akur-andrew-lamb.atom.xml
+++ b/output/feeds/mustafa-akur-andrew-lamb.atom.xml
@@ -124,7 +124,7 @@ Being logically streamable does not guarantee that a query
will execute in a str
</table>
<p><br/></p>
<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
-<strong>How can a table have multiple orderings?:</strong> At
first glance it may seem counterintuitive for a table to have more than one
valid ordering. However, during query execution such scenarios can arise.
+<strong>How can a table have multiple orderings?</strong> At first
glance it may seem counterintuitive for a table to have more than one valid
ordering. However, during query execution such scenarios can arise.
For example consider the following query:
@@ -271,6 +271,46 @@ Following third and fourth constraints for the simplified
table, the succinct va
<code>[amount ASC, price ASC]</code>, <br/>
<code>[time_bin ASC]</code>,<br/>
<code>[time ASC]</code> </p>
+<blockquote style="border-left: 4px solid #007bff; padding: 10px;
background-color: #f8f9fa;">
+<p><strong>How can DataFusion find
orderings?</strong></p>
+DataFusion's <code>CREATE EXTERNAL TABLE</code> has a
<code>WITH ORDER</code> clause (see <a
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>)
to specify the known orderings of the table during table creation. For example
the following query:<br/>
+<pre><code>
+CREATE EXTERNAL TABLE source (
+ amount INT NOT NULL,
+ price DOUBLE NOT NULL,
+ time TIMESTAMP NOT NULL,
+ ...
+)
+STORED AS CSV
+WITH ORDER (time ASC)
+WITH ORDER (amount ASC, price ASC)
+LOCATION '/path/to/FILE_NAME.csv'
+OPTIONS ('has_header' 'true');
+</code></pre>
+communicates that <code>source</code> table has the orderings:
<code>[time ASC]</code> and <code>[amount ASC, price
ASC]</code>.<br/>
+When orderings are communicated from the source, DataFusion tracks the
orderings through each operator while optimizing the plan.<br/>
+<ul>
+<li>add new orderings (such as when "date_bin" function is applied to
the "time" column)</li>
+<li>Remove orderings, if operation doesn't preserve the ordering of the
data at its input</li>
+<li>Update equivalent groups</li>
+<li>Update constant expressions</li>
+</ul>
+
+Figure 1 shows an example how DataFusion generates an efficient plan for the
query:
+<pre><code>
+SELECT
+ row_number() OVER (ORDER BY time) as rn,
+ time
+FROM events
+ORDER BY rn, time
+</code></pre>
+using the orderings of the query intermediates.<br/>
+<br/>
+<figure>
+<img alt="Window Query Datafusion Optimization" class="img-responsive"
src="/blog/images/ordering_analysis/query_window_plan.png" width="80%"/>
+<figcaption><strong>Figure 1:</strong> DataFusion analyzes
orderings of the sources and query intermediates to generate efficient
plans</figcaption>
+</figure>
+</blockquote>
<h3>Table Properties</h3>
<p>In summary, for the example table, the following properties correctly
describe the sort properties:</p>
<ul>
diff --git a/output/images/ordering_analysis/query_window_plan.png
b/output/images/ordering_analysis/query_window_plan.png
new file mode 100644
index 0000000..ca30d22
Binary files /dev/null and
b/output/images/ordering_analysis/query_window_plan.png differ
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]