alamb commented on code in PR #59: URL: https://github.com/apache/datafusion-site/pull/59#discussion_r1998872033
########## content/blog/2025-03-11-ordering-analysis.md: ########## @@ -291,6 +291,31 @@ Following third and fourth constraints for the simplified table, the succinct va `[time_bin ASC]`, `[time ASC]` +<blockquote style="border-left: 4px solid #007bff; padding: 10px; background-color: #f8f9fa;"> +<strong>How DataFusion discovers orderings?</strong> Datafusion has <code>WITH ORDER</code> clause (see <a href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>) to specify the known orderings of the table during table creation. For example, following query:<br> Review Comment: ```suggestion <strong>How can DataFusion find orderings?</strong> DataFusion's `CREATE EXTERNAL TABLE` has a <code>WITH ORDER</code> clause (see <a href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>) to specify the known orderings of the table during table creation. For example, following query:<br> ``` ########## content/blog/2025-03-11-ordering-analysis.md: ########## @@ -291,6 +291,31 @@ Following third and fourth constraints for the simplified table, the succinct va `[time_bin ASC]`, `[time ASC]` +<blockquote style="border-left: 4px solid #007bff; padding: 10px; background-color: #f8f9fa;"> +<strong>How DataFusion discovers orderings?</strong> Datafusion has <code>WITH ORDER</code> clause (see <a href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table">docs</a>) to specify the known orderings of the table during table creation. For example, following query:<br> +<pre><code> +CREATE EXTERNAL TABLE source ( + amount INT NOT NULL, + price DOUBLE NOT NULL, + time TIMESTAMP NOT NULL, + ... +) +STORED AS CSV +WITH ORDER (time ASC) +WITH ORDER (amount ASC, price ASC) +LOCATION '/path/to/FILE_NAME.csv' +OPTIONS ('has_header' 'true'); +</code></pre> +communicates that "source" table has the orderings: <code>[time ASC]</code> and <code>[amount ASC, price ASC]</code>.<br> +When orderings are properly communicated from the source, DataFusion can track the transformations at each operator. According to transformations happening in each operator DataFusion can<br> +<ul> +<li>add new orderings (such as when "date_bin" function is applied to the "time" column)</li> +<li>Remove orderings, if operation doesn't preserve the ordering of the data at its input</li> +<li>Update equivalent groups</li> +<li>Update constant expressions</li> +</ul> +</blockquote> Review Comment: I think it might be helpful to explicitly point out that DataFusion can find / use orderings based on query intermediates even if the original source is not ordered For example ``` SELECT row_number() OVER (ORDER BY time) as rn, time FROM events ORDER BY rn, time ``` Knows that it does not have to reorder by time ```sql > create table events (time timestamp) as values ('2000-01-01'); 0 row(s) fetched. Elapsed 0.041 seconds. > select * from events; +---------------------+ | time | +---------------------+ | 2000-01-01T00:00:00 | +---------------------+ 1 row(s) fetched. Elapsed 0.004 seconds. > explain format tree SELECT row_number() OVER (ORDER BY time) as rn, time FROM events ORDER BY rn, time; +---------------+-------------------------------+ | plan_type | plan | +---------------+-------------------------------+ | physical_plan | ┌───────────────────────────┐ | | | │ ProjectionExec │ | | | │ -------------------- │ | | | │ rn: │ | | | │ row_number() ORDER BY │ | | | │ [events.time ASC │ | | | │ NULLS LAST] RANGE │ | | | │ BETWEEN UNBOUNDED │ | | | │ PRECEDING AND │ | | | │ CURRENT ROW │ | | | │ │ | | | │ time: time │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ BoundedWindowAggExec │ | | | │ -------------------- │ | | | │ mode: Sorted │ | | | │ │ | | | │ select_list: │ | | | │ row_number() ORDER BY │ | | | │ [events.time ASC │ | | | │ NULLS LAST] RANGE │ | | | │ BETWEEN UNBOUNDED │ | | | │ PRECEDING AND │ | | | │ CURRENT ROW │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ SortExec │ | | | │ -------------------- │ | | | │ time@0 ASC NULLS LAST │ | | | └─────────────┬─────────────┘ | | | ┌─────────────┴─────────────┐ | | | │ DataSourceExec │ | | | │ -------------------- │ | | | │ bytes: 128 │ | | | │ format: memory │ | | | │ rows: 1 │ | | | └───────────────────────────┘ | | | | +---------------+-------------------------------+ 1 row(s) fetched. Elapsed 0.003 seconds. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org