Re: [PR] test: interval analysis unit tests [datafusion]

via GitHub Mon, 20 Jan 2025 09:55:55 -0800


berkaysynnada commented on code in PR #14189:
URL: https://github.com/apache/datafusion/pull/14189#discussion_r1922727191



##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -246,3 +246,124 @@ fn calculate_selectivity(
             acc * cardinality_ratio(&initial.interval, &target.interval)
         })
 }
+
+#[cfg(test)]
+mod tests {
+    use std::sync::Arc;
+
+    use arrow_schema::{DataType, Field, Schema};
+    use datafusion_common::{assert_contains, DFSchema};
+    use datafusion_expr::{
+        col, execution_props::ExecutionProps, interval_arithmetic::Interval, 
lit, Expr,
+    };
+
+    use crate::{create_physical_expr, AnalysisContext};
+
+    use super::{analyze, ExprBoundaries};
+
+    fn make_field(name: &str, data_type: DataType) -> Field {
+        let nullable = false;
+        Field::new(name, data_type, nullable)
+    }
+
+    #[test]
+    fn test_analyze_boundary_exprs() {
+        let schema = Arc::new(Schema::new(vec![make_field("a", 
DataType::Int32)]));
+
+        /// Test case containing (expression tree, lower bound, upper bound)
+        type TestCase = (Expr, Option<i32>, Option<i32>);
+
+        let test_cases: Vec<TestCase> = vec![
+            // a > 10
+            (col("a").gt(lit(10)), Some(11), None),
+            // a < 20
+            (col("a").lt(lit(20)), None, Some(19)),
+            // a > 10 AND a < 20
+            (
+                col("a").gt(lit(10)).and(col("a").lt(lit(20))),
+                Some(11),
+                Some(19),
+            ),
+            // a >= 10
+            (col("a").gt_eq(lit(10)), Some(10), None),
+            // a <= 20
+            (col("a").lt_eq(lit(20)), None, Some(20)),
+            // a >= 10 AND a <= 20
+            (
+                col("a").gt_eq(lit(10)).and(col("a").lt_eq(lit(20))),
+                Some(10),
+                Some(20),
+            ),
+            // a > 10 AND a < 20 AND a < 15
+            (
+                col("a")
+                    .gt(lit(10))
+                    .and(col("a").lt(lit(20)))
+                    .and(col("a").lt(lit(15))),
+                Some(11),
+                Some(14),
+            ),
+            // (a > 10 AND a < 20) AND (a > 15 AND a < 25)
+            (
+                col("a")
+                    .gt(lit(10))
+                    .and(col("a").lt(lit(20)))
+                    .and(col("a").gt(lit(15)))
+                    .and(col("a").lt(lit(25))),
+                Some(16),
+                Some(19),
+            ),
+            // (a > 10 AND a < 20) AND (a > 20 AND a < 30)
+            (
+                col("a")
+                    .gt(lit(10))
+                    .and(col("a").lt(lit(20)))
+                    .and(col("a").gt(lit(20)))
+                    .and(col("a").lt(lit(30))),
+                None,
+                None,
+            ),
+        ];
+        for (expr, lower, upper) in test_cases {
+            let boundaries = 
ExprBoundaries::try_new_unbounded(&schema).unwrap();
+            let df_schema = DFSchema::try_from(Arc::clone(&schema)).unwrap();
+            let physical_expr =
+                create_physical_expr(&expr, &df_schema, 
&ExecutionProps::new()).unwrap();
+            let analysis_result = analyze(
+                &physical_expr,
+                AnalysisContext::new(boundaries),
+                df_schema.as_ref(),
+            )
+            .unwrap();
+            let actual = &analysis_result.boundaries[0].interval;
+            let expected = Interval::make(lower, upper).unwrap();
+            assert_eq!(
+                &expected, actual,
+                "did not get correct interval for SQL expression: {expr:?}"
+            );
+        }
+    }
+
+    #[test]
+    fn test_analyze_invalid_boundary_exprs() {
+        let schema = Arc::new(Schema::new(vec![make_field("a", 
DataType::Int32)]));
+        type TestCase = (Expr, &'static str);
+        let test_cases: Vec<TestCase> = vec![(

Review Comment:
   To avoid any misunderstanding, I'd like to elaborate what is in my mind: 
There are 3 types of analyze result:
   1) The analysis result will shrink the bounds (a successful evaluation)
   2) The analysis result will be successful again,  but the bounds could not 
be shrunk ,e.g. (a<0 || a>=0). That will be represented as [None, None], and 
means range of [-∞, ∞].
   3) The analysis results will be [None,None], just like in this example, 
which is an empty-set actually. If you evaluate `a<0 AND a>0` in interval 
arithmetic tools, you will get the result of [false, false] (CERTAINLY_FALSE). 
So, there is nothing wrong there. However, in the scope of AnalysisContext, 
these are mixed up. The solution would be an another Option<> wrapper, 
representing the evaluation comes up with a valid range, or the result is an 
empty-set.
   4) As you mention, there are unsupported types. We are (or should) giving 
internal errors for them.
   
   As you're increasing the test coverage, and there is not a bug which should 
be resolved immediately, we can continue with the current state of this PR. 
However, if you are willing to solve that possible confusion, I can provide 
further recommendation in detail (related with the 2nd and 3rd items).
   
   BTW, What I wanted to emphasize in this thread, you have defined a vector of 
single element
   ```
       let test_cases: Vec<TestCase> = vec![(
            col("a").lt(lit(10)).or(col("a").gt(lit(20))),
                "Interval arithmetic does not support the operator OR",
            )];
   ```
   I just wanted to point out that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] test: interval analysis unit tests [datafusion]

Reply via email to