alamb commented on code in PR #59:
URL: https://github.com/apache/datafusion-site/pull/59#discussion_r1998872033


##########
content/blog/2025-03-11-ordering-analysis.md:
##########
@@ -291,6 +291,31 @@ Following third and fourth constraints for the simplified 
table, the succinct va
 `[time_bin ASC]`,  
 `[time ASC]`  
 
+<blockquote style="border-left: 4px solid #007bff; padding: 10px; 
background-color: #f8f9fa;">
+<strong>How DataFusion discovers orderings?</strong> Datafusion has <code>WITH 
ORDER</code> clause (see <a 
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table";>docs</a>)
 to specify the known orderings of the table during table creation. For 
example, following query:<br>

Review Comment:
   ```suggestion
   <strong>How can DataFusion find orderings?</strong> DataFusion's `CREATE 
EXTERNAL TABLE` has a <code>WITH ORDER</code> clause (see <a 
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table";>docs</a>)
 to specify the known orderings of the table during table creation. For 
example, following query:<br>
   ```
   
   



##########
content/blog/2025-03-11-ordering-analysis.md:
##########
@@ -291,6 +291,31 @@ Following third and fourth constraints for the simplified 
table, the succinct va
 `[time_bin ASC]`,  
 `[time ASC]`  
 
+<blockquote style="border-left: 4px solid #007bff; padding: 10px; 
background-color: #f8f9fa;">
+<strong>How DataFusion discovers orderings?</strong> Datafusion has <code>WITH 
ORDER</code> clause (see <a 
href="https://datafusion.apache.org/user-guide/sql/ddl.html#create-external-table";>docs</a>)
 to specify the known orderings of the table during table creation. For 
example, following query:<br>
+<pre><code>
+CREATE EXTERNAL TABLE source (
+    amount INT NOT NULL,
+    price DOUBLE NOT NULL,
+    time TIMESTAMP NOT NULL,
+    ...
+)
+STORED AS CSV
+WITH ORDER (time ASC)
+WITH ORDER (amount ASC, price ASC)
+LOCATION '/path/to/FILE_NAME.csv'
+OPTIONS ('has_header' 'true');
+</code></pre>
+communicates that "source" table has the orderings: <code>[time ASC]</code> 
and <code>[amount ASC, price ASC]</code>.<br>
+When orderings are properly communicated from the source, DataFusion can track 
the transformations at each operator. According to transformations happening in 
each operator DataFusion can<br>
+<ul>
+<li>add new orderings (such as when "date_bin" function is applied to the 
"time" column)</li>
+<li>Remove orderings, if operation doesn't preserve the ordering of the data 
at its input</li>
+<li>Update equivalent groups</li>
+<li>Update constant expressions</li>
+</ul>
+</blockquote>

Review Comment:
   I think it might be helpful to explicitly point out that DataFusion can find 
/ use orderings based on query intermediates even if the original source is not 
ordered
   
   For example
   ```
   SELECT 
     row_number() OVER (ORDER BY time) as rn,
     time
   FROM events
   ORDER BY rn, time
   ```
   
   Knows that it does not have to reorder by time
   
   ```sql
   > create table events (time timestamp) as values ('2000-01-01');
   0 row(s) fetched.
   Elapsed 0.041 seconds.
   
   > select * from events;
   +---------------------+
   | time                |
   +---------------------+
   | 2000-01-01T00:00:00 |
   +---------------------+
   1 row(s) fetched.
   Elapsed 0.004 seconds.
   
   > explain format tree SELECT
     row_number() OVER (ORDER BY time) as rn,
     time
   FROM events
   ORDER BY rn, time;
   +---------------+-------------------------------+
   | plan_type     | plan                          |
   +---------------+-------------------------------+
   | physical_plan | ┌───────────────────────────┐ |
   |               | │       ProjectionExec      │ |
   |               | │    --------------------   │ |
   |               | │            rn:            │ |
   |               | │   row_number() ORDER BY   │ |
   |               | │      [events.time ASC     │ |
   |               | │      NULLS LAST] RANGE    │ |
   |               | │      BETWEEN UNBOUNDED    │ |
   |               | │        PRECEDING AND      │ |
   |               | │         CURRENT ROW       │ |
   |               | │                           │ |
   |               | │         time: time        │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │    BoundedWindowAggExec   │ |
   |               | │    --------------------   │ |
   |               | │        mode: Sorted       │ |
   |               | │                           │ |
   |               | │        select_list:       │ |
   |               | │   row_number() ORDER BY   │ |
   |               | │      [events.time ASC     │ |
   |               | │      NULLS LAST] RANGE    │ |
   |               | │      BETWEEN UNBOUNDED    │ |
   |               | │        PRECEDING AND      │ |
   |               | │         CURRENT ROW       │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │          SortExec         │ |
   |               | │    --------------------   │ |
   |               | │   time@0 ASC NULLS LAST   │ |
   |               | └─────────────┬─────────────┘ |
   |               | ┌─────────────┴─────────────┐ |
   |               | │       DataSourceExec      │ |
   |               | │    --------------------   │ |
   |               | │         bytes: 128        │ |
   |               | │       format: memory      │ |
   |               | │          rows: 1          │ |
   |               | └───────────────────────────┘ |
   |               |                               |
   +---------------+-------------------------------+
   1 row(s) fetched.
   Elapsed 0.003 seconds.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to