Re: [PR] feat: eliminate redundant sorts on monotonic expressions [arrow-datafusion]

via GitHub Wed, 10 Apr 2024 11:19:11 -0700


alamb commented on code in PR #9813:
URL: https://github.com/apache/arrow-datafusion/pull/9813#discussion_r1559856903



##########
datafusion/physical-expr/src/equivalence/mod.rs:
##########
@@ -45,6 +46,46 @@ pub fn collapse_lex_req(input: LexRequirement) -> 
LexRequirement {
     output
 }
 
+/// This function constructs a normalized [`LexRequirement`] by filtering out 
entries
+/// that are ordered if the next entry is.

Review Comment:
   
   
   It seems like there are no additional callsites of `collapse_lex_ordering` 
-- perhaps we should simply update `collapse_lex_ordering` instead of making a 
new function 🤔 
   
   
https://github.com/search?q=repo%3Aapache%2Farrow-datafusion+collapse_lex_ordering&type=code



##########
datafusion/sqllogictest/test_files/filter_without_sort_exec.slt:
##########
@@ -43,19 +43,36 @@ SortPreservingMergeExec: [date@0 ASC NULLS LAST,time@2 ASC 
NULLS LAST]
 ------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
 --------StreamingTableExec: partition_sizes=1, projection=[date, ticker, 
time], infinite_source=true, output_ordering=[date@0 ASC NULLS LAST, ticker@1 
ASC NULLS LAST, time@2 ASC NULLS LAST]
 
+# query
+query TT

Review Comment:
   Could you also test some similar queries that should also have their sorts 
removed? LIke
   
   ```sql
   SELECT * FROM data
   WHERE ticker = 'A' AND CAST(time AS DATE) = date
   ORDER BY "date"
   ```
   
   ```sql
   SELECT * FROM data
   WHERE ticker = 'A' AND CAST(time AS DATE) = date
   ORDER BY "ticker"
   ```
   
   As well as some cases where it should not:
   
   
   ```sql
   SELECT * FROM data
   WHERE ticker = 'A' AND CAST(time AS DATE) <> date
   ORDER BY "timestamp"
   ```
   



##########
datafusion/physical-expr/src/equivalence/properties.rs:
##########
@@ -2212,6 +2212,97 @@ mod tests {
             );
         }
 
+        Ok(())
+    }
+    #[test]
+    fn test_eliminate_redundant_monotonic_sorts() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Date32, true),
+            Field::new("b", DataType::Utf8, true),
+            Field::new("c", DataType::Timestamp(TimeUnit::Nanosecond, None), 
true),
+        ]));
+        let base_properties = 
EquivalenceProperties::new(schema.clone()).with_reorder(
+            ["a", "b", "c"]
+                .into_iter()
+                .map(|c| {
+                    col(c, schema.as_ref()).map(|expr| PhysicalSortExpr {
+                        expr,
+                        options: SortOptions {
+                            descending: false,
+                            nulls_first: true,
+                        },
+                    })
+                })
+                .collect::<Result<Vec<_>>>()?,
+        );
+
+        struct TestCase {
+            name: &'static str,
+            constants: Vec<Arc<dyn PhysicalExpr>>,
+            equal_conditions: Vec<[Arc<dyn PhysicalExpr>; 2]>,
+            sort_columns: &'static [&'static str],
+            should_satisfy_ordering: bool,
+        }
+
+        let cases = vec![

Review Comment:
   I would find these cases easier to read if they were written like
   
   ```rust
   let col_a = col("a", schema.as_ref())?;
   let col_b = col("b", schema.as_ref())?;
   let cast_c =  Arc::new(CastExpr::new(
                           col("c", schema.as_ref())?,
                           DataType::Date32,
                           None,
                       ));
   ...
   
               TestCase {
                   name: "(date, ticker, time) -> (time)",
                   // b is constant, so it should be removed from the sort order
                   constants: vec![col_b.clone()?],
                   equal_conditions: vec![[cast_c.clone(), col_a.clone()]],
                   sort_columns: &["c"],
                   should_satisfy_ordering: true,
   ...
   ```



##########
datafusion/physical-expr/src/equivalence/properties.rs:
##########
@@ -2212,6 +2212,97 @@ mod tests {
             );
         }
 
+        Ok(())
+    }
+    #[test]
+    fn test_eliminate_redundant_monotonic_sorts() -> Result<()> {
+        let schema = Arc::new(Schema::new(vec![
+            Field::new("a", DataType::Date32, true),

Review Comment:
   Would it be possible to use the same names in the schema `(a, b, c)` that 
are used in the test description `(date, ticker, time)`? I found it hard to 
mentally map between the two while reviewing this code



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: eliminate redundant sorts on monotonic expressions [arrow-datafusion]

Reply via email to