alamb commented on code in PR #9273:
URL: https://github.com/apache/arrow-datafusion/pull/9273#discussion_r1495347562


##########
datafusion/physical-expr/src/equivalence/properties.rs:
##########
@@ -1054,8 +1118,8 @@ impl DependencyNode {
     }
 }
 
-type DependencyMap = HashMap<PhysicalSortExpr, DependencyNode>;
-type Dependencies = HashSet<PhysicalSortExpr>;
+type DependencyMap = IndexMap<PhysicalSortExpr, DependencyNode>;

Review Comment:
   Maybe we can add a comment or two explaining *why* this uses an IndexMap and 
not a HashMap presumably because the order matters? Or it is required to get 
consistent plans?



##########
datafusion/physical-expr/src/equivalence/properties.rs:
##########
@@ -426,6 +426,71 @@ impl EquivalenceProperties {
         (!meet.is_empty()).then_some(meet)
     }
 
+    /// we substitute the ordering according to input expression type, this is 
a simplified version
+    /// In this case, we just substitute when the expression satisfy the 
following condition:
+    /// I. just have one column and is a CAST expression
+    /// TODO: Add one-to-ones analysis for monotonic ScalarFunctions.
+    /// TODO: we could precompute all the scenario that is computable, for 
example: atan(x + 1000) should also be substituted if
+    ///  x is DESC or ASC
+    /// After substitution, we may generate more than 1 `LexOrdering`. As an 
example,
+    /// `[a ASC, b ASC]` will turn into `[a ASC, b ASC], [CAST(a) ASC, b ASC]` 
when projection expressions `a, b, CAST(a)` is applied.
+    pub fn substitute_ordering_component(
+        &self,
+        mapping: &ProjectionMapping,
+        sort_expr: &[PhysicalSortExpr],
+    ) -> Result<Vec<Vec<PhysicalSortExpr>>> {
+        let new_orderings = sort_expr
+            .iter()
+            .map(|sort_expr| {
+                let referring_exprs: Vec<_> = mapping
+                    .iter()
+                    .map(|(source, _target)| source)
+                    .filter(|source| expr_refers(source, &sort_expr.expr))
+                    .cloned()
+                    .collect();
+                let mut res = vec![sort_expr.clone()];
+                // TODO: Add one-to-ones analysis for ScalarFunctions.
+                for r_expr in referring_exprs {
+                    // we check whether this expression is substitutable or not
+                    if let Some(cast_expr) = 
r_expr.as_any().downcast_ref::<CastExpr>() {
+                        // we need to know whether the Cast Expr matches or not
+                        let expr_type = 
sort_expr.expr.data_type(&self.schema)?;
+                        if cast_expr.expr.eq(&sort_expr.expr)

Review Comment:
   I wonder why this is checking for `cast` explicitly and not for 
`expr.is_monotonic`? It seems the logic holds as long as the expression is 
monotonic, and is more general than just casting



##########
datafusion/physical-expr/src/expressions/cast.rs:
##########
@@ -76,6 +76,26 @@ impl CastExpr {
     pub fn cast_options(&self) -> &CastOptions<'static> {
         &self.cast_options
     }
+    pub fn is_bigger_cast(&self, src: DataType) -> bool {

Review Comment:
   I think this is similar to what is in the coercion logic. Maybe we can 
consolidate it at some point



##########
datafusion/sqllogictest/test_files/monotonic_projection_test.slt:
##########
@@ -0,0 +1,133 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Create a source where there is multiple orderings.
+statement ok
+CREATE EXTERNAL TABLE multiple_ordered_table (
+  a0 INTEGER,
+  a INTEGER,
+  b INTEGER,
+  c INTEGER,
+  d INTEGER
+)
+STORED AS CSV
+WITH HEADER ROW
+WITH ORDER (a ASC, b ASC)
+WITH ORDER (c ASC)
+LOCATION '../core/tests/data/window_2.csv';
+
+# test for substitute CAST scenario
+query TT
+EXPLAIN
+SELECT 
+    CAST(a AS BIGINT) AS a_big,
+    b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: CAST(multiple_ordered_table.a AS Int64) AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_big@0 ASC NULLS LAST,b@1 ASC NULLS LAST]
+--ProjectionExec: expr=[CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+query TT
+EXPLAIN
+SELECT a, CAST(a AS BIGINT) AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a ASC, b ASC;
+----
+logical_plan
+Sort: multiple_ordered_table.a ASC NULLS LAST, multiple_ordered_table.b ASC 
NULLS LAST
+--Projection: multiple_ordered_table.a, CAST(multiple_ordered_table.a AS 
Int64) AS a_big, multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a@0 ASC NULLS LAST,b@2 ASC NULLS LAST]
+--ProjectionExec: expr=[a@0 as a, CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+# TODO: Should generate without SortExec
+query TT
+EXPLAIN
+SELECT a, CAST(a AS BIGINT) AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: multiple_ordered_table.a, CAST(multiple_ordered_table.a AS 
Int64) AS a_big, multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_big@1 ASC NULLS LAST,b@2 ASC NULLS LAST]
+--ProjectionExec: expr=[a@0 as a, CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+# test for common rename
+query TT
+EXPLAIN
+SELECT a, a AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: multiple_ordered_table.a, multiple_ordered_table.a AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+ProjectionExec: expr=[a@0 as a, a@0 as a_big, b@1 as b]
+--CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+query TT
+EXPLAIN
+SELECT a, a AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a ASC, b ASC;
+----
+logical_plan
+Sort: multiple_ordered_table.a ASC NULLS LAST, multiple_ordered_table.b ASC 
NULLS LAST
+--Projection: multiple_ordered_table.a, multiple_ordered_table.a AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+ProjectionExec: expr=[a@0 as a, a@0 as a_big, b@1 as b]
+--CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+
+# test for cast Utf8

Review Comment:
   ```suggestion
   # test for cast Utf8
   # (must actually sort as the sort order for a number cast to utf8 is 
different than for int)
   ```



##########
datafusion/sqllogictest/test_files/monotonic_projection_test.slt:
##########
@@ -0,0 +1,133 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Create a source where there is multiple orderings.
+statement ok
+CREATE EXTERNAL TABLE multiple_ordered_table (
+  a0 INTEGER,
+  a INTEGER,
+  b INTEGER,
+  c INTEGER,
+  d INTEGER
+)
+STORED AS CSV
+WITH HEADER ROW
+WITH ORDER (a ASC, b ASC)
+WITH ORDER (c ASC)
+LOCATION '../core/tests/data/window_2.csv';
+
+# test for substitute CAST scenario
+query TT
+EXPLAIN
+SELECT 
+    CAST(a AS BIGINT) AS a_big,
+    b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: CAST(multiple_ordered_table.a AS Int64) AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_big@0 ASC NULLS LAST,b@1 ASC NULLS LAST]
+--ProjectionExec: expr=[CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+query TT
+EXPLAIN
+SELECT a, CAST(a AS BIGINT) AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a ASC, b ASC;
+----
+logical_plan
+Sort: multiple_ordered_table.a ASC NULLS LAST, multiple_ordered_table.b ASC 
NULLS LAST
+--Projection: multiple_ordered_table.a, CAST(multiple_ordered_table.a AS 
Int64) AS a_big, multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a@0 ASC NULLS LAST,b@2 ASC NULLS LAST]
+--ProjectionExec: expr=[a@0 as a, CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+# TODO: Should generate without SortExec
+query TT
+EXPLAIN
+SELECT a, CAST(a AS BIGINT) AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: multiple_ordered_table.a, CAST(multiple_ordered_table.a AS 
Int64) AS a_big, multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_big@1 ASC NULLS LAST,b@2 ASC NULLS LAST]
+--ProjectionExec: expr=[a@0 as a, CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+# test for common rename
+query TT
+EXPLAIN
+SELECT a, a AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: multiple_ordered_table.a, multiple_ordered_table.a AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+ProjectionExec: expr=[a@0 as a, a@0 as a_big, b@1 as b]
+--CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+query TT
+EXPLAIN
+SELECT a, a AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a ASC, b ASC;
+----
+logical_plan
+Sort: multiple_ordered_table.a ASC NULLS LAST, multiple_ordered_table.b ASC 
NULLS LAST
+--Projection: multiple_ordered_table.a, multiple_ordered_table.a AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+ProjectionExec: expr=[a@0 as a, a@0 as a_big, b@1 as b]
+--CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+
+# test for cast Utf8
+query TT
+EXPLAIN
+SELECT 
+    CAST(a AS STRING) AS a_str,
+    b
+FROM multiple_ordered_table
+ORDER BY a_str ASC, b ASC;
+----
+logical_plan
+Sort: a_str ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: CAST(multiple_ordered_table.a AS Utf8) AS a_str, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_str@0 ASC NULLS LAST,b@1 ASC NULLS LAST]
+--SortExec: expr=[a_str@0 ASC NULLS LAST,b@1 ASC NULLS LAST]
+----ProjectionExec: expr=[CAST(a@0 AS Utf8) as a_str, b@1 as b]
+------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+--------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true

Review Comment:
   I suggest two more negative tests:
   *  `ORDER BY a + b`  
   * `ORDER BY CAST(a + b AS BIGINT)`
   
   To show that `+` if a column is not order preserving 



##########
datafusion/sqllogictest/test_files/monotonic_projection_test.slt:
##########
@@ -0,0 +1,133 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Create a source where there is multiple orderings.
+statement ok
+CREATE EXTERNAL TABLE multiple_ordered_table (
+  a0 INTEGER,
+  a INTEGER,
+  b INTEGER,
+  c INTEGER,
+  d INTEGER
+)
+STORED AS CSV
+WITH HEADER ROW
+WITH ORDER (a ASC, b ASC)
+WITH ORDER (c ASC)
+LOCATION '../core/tests/data/window_2.csv';
+
+# test for substitute CAST scenario
+query TT
+EXPLAIN
+SELECT 
+    CAST(a AS BIGINT) AS a_big,
+    b
+FROM multiple_ordered_table
+ORDER BY a_big ASC, b ASC;
+----
+logical_plan
+Sort: a_big ASC NULLS LAST, multiple_ordered_table.b ASC NULLS LAST
+--Projection: CAST(multiple_ordered_table.a AS Int64) AS a_big, 
multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a_big@0 ASC NULLS LAST,b@1 ASC NULLS LAST]
+--ProjectionExec: expr=[CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+query TT
+EXPLAIN
+SELECT a, CAST(a AS BIGINT) AS a_big, b
+FROM multiple_ordered_table
+ORDER BY a ASC, b ASC;
+----
+logical_plan
+Sort: multiple_ordered_table.a ASC NULLS LAST, multiple_ordered_table.b ASC 
NULLS LAST
+--Projection: multiple_ordered_table.a, CAST(multiple_ordered_table.a AS 
Int64) AS a_big, multiple_ordered_table.b
+----TableScan: multiple_ordered_table projection=[a, b]
+physical_plan
+SortPreservingMergeExec: [a@0 ASC NULLS LAST,b@2 ASC NULLS LAST]
+--ProjectionExec: expr=[a@0 as a, CAST(a@0 AS Int64) as a_big, b@1 as b]
+----RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------CsvExec: file_groups={1 group: 
[[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a, b], 
output_ordering=[a@0 ASC NULLS LAST, b@1 ASC NULLS LAST], has_header=true
+
+# TODO: Should generate without SortExec

Review Comment:
   The plan doesn't seem to have a `SortExec` to me -- maybe the comment is out 
of date 🤔 



##########
datafusion/sqllogictest/test_files/monotonic_projection_test.slt:
##########
@@ -0,0 +1,133 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+
+#   http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# Create a source where there is multiple orderings.
+statement ok
+CREATE EXTERNAL TABLE multiple_ordered_table (
+  a0 INTEGER,
+  a INTEGER,
+  b INTEGER,
+  c INTEGER,
+  d INTEGER
+)
+STORED AS CSV
+WITH HEADER ROW
+WITH ORDER (a ASC, b ASC)

Review Comment:
   Neat



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to