alamb commented on code in PR #58:
URL: https://github.com/apache/datafusion-site/pull/58#discussion_r1983788745
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,176 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+In this blog post, we will explore how to determine whether an ordering
requirement[^1] of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
Review Comment:
Another point I don't see mentioned here is the connection between
sortedness analysis and "streamability" which I think is quite a key insight
and not one I was aware of before working with you and @ozankabak in
DataFusion. It is a very powerful concept I think
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,176 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+In this blog post, we will explore how to determine whether an ordering
requirement[^1] of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
+
+Some operators in the physical query plan require their input data to be
ordered. The reasons for these requirements include:
+- **More efficient implementation** (e.g., `SortMergeJoin`)
+- **Low memory footprint** (e.g., `SortMergeJoin`, `Aggregation`, `Windowing`)
+- **Hard requirements from the query itself** (e.g., `Window` operator with
`ORDER BY` clause)
+
+If an operator's ordering requirement is not satisfied by the input, a `Sort`
operator (e.g., `SortExec` in the `DataFusion` framework) may be inserted to
fulfill the requirement. If the requirement is already met, the plan can
proceed without an additional `Sort` operator.
+
+Correctly analyzing whether the ordering requirement is satisfied is critical.
Failure to do so can lead to:
+- **Incorrect query results** if a `Sort` operator is omitted.
+- **Inefficient execution plans** if a `Sort` operator is unnecessarily added.
+
+This post outlines the necessary properties of the table we must monitor for
this analysis and illustrates the analysis process.
+
+
+### Example Virtual Table
+
+Let's define a virtual table to model the input data of an operator for
analysis:
+
+| a1 | a2 | c1 | c2 | b1 | b2 | a2_clone | b2_clone |
+|----|----|----|----|----|----|----------|----------|
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
+| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
+| 1 | 1 | 0 | 1 | 0 | 2 | 1 | 2 |
+| 1 | 2 | 0 | 1 | 1 | 0 | 2 | 0 |
+| 2 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
+| 2 | 1 | 0 | 1 | 1 | 2 | 1 | 2 |
+
+## Key Concepts for Analyzing Orderings
Review Comment:
It might also be point people to the implementation for more information --
something like "These concepts are implemented in the EquivalenceProperties
structure in DataFusion, please see the source (link) for more details"
You do mention this in the conclusion but I suggest bringing it earlier in
the text so that it is clearer that this blog post is describing something
implemented in DataFusion
> The `DataFusion` query engine uses this kind of analysis during its
planning stage to ensure correct and efficient query execution. You can find
the implementation of this analysis in the [DataFusion
repository](https://github.com/apache/datafusion/tree/main/datafusion/physical-expr/src/equivalence).
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,176 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+In this blog post, we will explore how to determine whether an ordering
requirement[^1] of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
+
+Some operators in the physical query plan require their input data to be
ordered. The reasons for these requirements include:
Review Comment:
I think it would really help to provide a motivating example here, and start
simple
Perhaps a motivating example could be a query like this to get the most
recent data
```sql
SELECT hostname, log_line ORDER BY time DESC limit 10
```
And note that if the input is already sorted by descending timestamp, it can
be implemented super efficiently.
The key is that the query optimizer needs to know the data is already
sorted. For simple queries that is likely simple, but it gets complicated fast,
like for example, what if your data is sorted by `hostname, time DESC` and your
query is
```sql
SELECT hostname, log_line WHERE hostname = 'app.example.com' ORDER BY time
DESC;
```
In this case, the system still doesn't have to do any sorting -- but only if
it has enough analysis to be able to reason about the sortedness of the stream
when we know `hostname ` has a single value
(I can propose this directly to the post if you prefer, but I wanted to
write it as a comment / run it by you first)
----
Then the next section could lay out the other places where sorting is used
to improve performance in DataFusion (such as GroupBy without having to hash
the entire data, etc)
A more complex example might be:
```sql
SELECT sum(sales)
FROM lineitem l JOIN orders o ON (l.l_key = o.l_key)
```
And point out that if the data is already softed on `l_key` for both tables,
a much more efficient algorthm can be used)
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,176 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+In this blog post, we will explore how to determine whether an ordering
requirement[^1] of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
+
+Some operators in the physical query plan require their input data to be
ordered. The reasons for these requirements include:
+- **More efficient implementation** (e.g., `SortMergeJoin`)
+- **Low memory footprint** (e.g., `SortMergeJoin`, `Aggregation`, `Windowing`)
+- **Hard requirements from the query itself** (e.g., `Window` operator with
`ORDER BY` clause)
+
+If an operator's ordering requirement is not satisfied by the input, a `Sort`
operator (e.g., `SortExec` in the `DataFusion` framework) may be inserted to
fulfill the requirement. If the requirement is already met, the plan can
proceed without an additional `Sort` operator.
+
+Correctly analyzing whether the ordering requirement is satisfied is critical.
Failure to do so can lead to:
+- **Incorrect query results** if a `Sort` operator is omitted.
+- **Inefficient execution plans** if a `Sort` operator is unnecessarily added.
+
+This post outlines the necessary properties of the table we must monitor for
this analysis and illustrates the analysis process.
+
+
+### Example Virtual Table
+
+Let's define a virtual table to model the input data of an operator for
analysis:
+
+| a1 | a2 | c1 | c2 | b1 | b2 | a2_clone | b2_clone |
Review Comment:
I think this example would be easier to follow if it used more unique names
so that readers didn't have remember `a1` vs `c1` and `b2`, etc
Maybe we could use `hostname`, `time`, etc to follow the examples above. Or
maybe `time`, `symbol`, `stock_price`, etc
##########
content/blog/2025-03-05-ordering-analysis.md:
##########
@@ -0,0 +1,176 @@
+---
+layout: post
+title: Analysis of Ordering for Better Plans
+date: 2025-03-05
+author: Mustafa Akur
+categories: [tutorial]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+<!-- see https://github.com/apache/datafusion/issues/11631 for details -->
+
+## Introduction
+
+In this blog post, we will explore how to determine whether an ordering
requirement[^1] of an operator is satisfied by its input data. This analysis is
essential for order-based optimizations and is often more complex than one
might initially think.
+
+Some operators in the physical query plan require their input data to be
ordered. The reasons for these requirements include:
+- **More efficient implementation** (e.g., `SortMergeJoin`)
+- **Low memory footprint** (e.g., `SortMergeJoin`, `Aggregation`, `Windowing`)
+- **Hard requirements from the query itself** (e.g., `Window` operator with
`ORDER BY` clause)
+
+If an operator's ordering requirement is not satisfied by the input, a `Sort`
operator (e.g., `SortExec` in the `DataFusion` framework) may be inserted to
fulfill the requirement. If the requirement is already met, the plan can
proceed without an additional `Sort` operator.
+
+Correctly analyzing whether the ordering requirement is satisfied is critical.
Failure to do so can lead to:
+- **Incorrect query results** if a `Sort` operator is omitted.
+- **Inefficient execution plans** if a `Sort` operator is unnecessarily added.
+
+This post outlines the necessary properties of the table we must monitor for
this analysis and illustrates the analysis process.
+
+
+### Example Virtual Table
+
+Let's define a virtual table to model the input data of an operator for
analysis:
+
+| a1 | a2 | c1 | c2 | b1 | b2 | a2_clone | b2_clone |
+|----|----|----|----|----|----|----------|----------|
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
+| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
+| 1 | 1 | 0 | 1 | 0 | 2 | 1 | 2 |
+| 1 | 2 | 0 | 1 | 1 | 0 | 2 | 0 |
+| 2 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
+| 2 | 1 | 0 | 1 | 1 | 2 | 1 | 2 |
+
+## Key Concepts for Analyzing Orderings
+Before analyzing whether an ordering requirement is satisfied, we first need
to define a few properties of the input data. These properties help guide the
analysis process:
+
+### 1. Constant Expressions
+Constant expressions are those where each row in the expression has the same
value across all rows. Although constant expressions may seem odd in a table,
they can arise after operations like `Filter` or `Join`. In our example table,
expressions `c1` and `c2` are constant.
+
+For instance:
+- Expression `c1` and `c2` are constant because every row in the table has the
same value for these columns.
+
+### 2. Equivalent Expression Groups
+Equivalent expression groups are expressions that always hold the same value
across rows. These expressions can be thought of as clones of each other and
may arise from operations like `Filter`, `Join`, or `Projection`.
+
+For example, in our table, the expressions `a2` and `a2_clone` form one
equivalence group, and `b2` and `b2_clone` form another equivalence group.
+
+### 3. Valid Orderings
+Valid orderings are the orderings that the table already satisfies. However,
there are many possible options for valid ordering. Some of the valid orderings
for the table is as follows:
+`[a1 ASC, a2 ASC]`,
+`[a1 ASC]`,
+`[a1 ASC, a2_clone ASC]`,
+`[a1 ASC, a2 ASC, c1 ASC]`,
+`[a1 ASC, a2 ASC, c1 DESC]`,
+`[a1 ASC, c1 ASC, a2 ASC]`,
+`[a1 ASC, c1 DESC, a2 ASC]`,
+.
+.
+.
+As can be seen from the above valid orderings. Storing all of the valid
orderings is wasteful, and contains lots of redundancy. Some of the problems
are:
+
+- Storing prefix of another valid ordering is redundant. If the table
satisfies lexicographical ordering[^2]: `[a1 ASC, a2 ASC]`, it already
satisfies ordering `[a1 ASC]` trivially. Hence, once we store `[a1 ASC, a2
ASC]` we do not need to store `[a1 ASC]` seperately.
+
+- Using all entries in an Equivalent Expression Group is redundant. If we know
that ordering `[a1 ASC, a2 ASC]` is satisfied by the table, table also
satisfies `[a1 ASC, a2_clone ASC]` since `a2` and `a2_clone` are copy of each
other. Hence, it is enough to use just one expresssion (let's say first
expression) in an Equivalent Expression Group during the construction of the
the valid orderings.
+
+- Constant expressions can be inserted into any place inside valid ordering
with an arbitrary direction (`ASC`, `DESC`). Hence, If ordering `[a1 ASC, a2
ASC]` is valid, orderings: `[c1 ASC, a1 ASC, a2 ASC]`, `[c1 DESC, a1 ASC, a2
ASC]`, `[a1 ASC, c1 ASC, a2 ASC]`, .. are all also valid. This is clearly
redundant. For this reason, it is better to not use any constant expression
during existing ordering construction.
+
+In summary,
+
+- We should only use the longest lexicographical ordering as a valid ordering
(shouldn't use any prefix of it)
+- Ordering should contain only one expression from an equivalence group (a
representative of the group).
+- Existing ordering expressions shouldn't contain any constant expression.
+
+Adhering to these principles, valid orderings are `[a1 ASC, a2 ASC]`, `[b1
ASC, b2 ASC]` in our table.
+
+### Table Properties
+
+| a1 | a2 | c1 | c2 | b1 | b2 | a2_clone | b2_clone |
+|----|----|----|----|----|----|----------|----------|
+| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
+| 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+| 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
+| 1 | 1 | 0 | 1 | 0 | 2 | 1 | 2 |
+| 1 | 2 | 0 | 1 | 1 | 0 | 2 | 0 |
+| 2 | 0 | 0 | 1 | 1 | 1 | 0 | 1 |
+| 2 | 1 | 0 | 1 | 1 | 2 | 1 | 2 |
+
+For the table above, the following properties can be derived:
+- **Constant Expressions** = `[c1, c2]`
+- **Equivalent Expression Groups** = `[[a2, a2_clone], [b2, b2_clone]]`
+- **Valid Orderings** = `[[a1 ASC, a2 ASC], [b1 ASC, b2 ASC]]`
+
+### Algorithm for Analyzing Ordering Requirements
+
+We can use following algorithm to check whether an ordering requirement is
satisfied by a table:
Review Comment:
I think it would be helpful to tie "ordering requirement" back to what it is
being used for
So maybe something like
```suggestion
We can use following algorithm to check whether an more efficient operation
that requires a certain ordering is satisfied by an existing table:
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]