Re: [PR] Support bounds evaluation for temporal data types [datafusion]

via GitHub Thu, 17 Apr 2025 10:23:04 -0700


berkaysynnada commented on code in PR #14523:
URL: https://github.com/apache/datafusion/pull/14523#discussion_r2049071746



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -963,6 +961,23 @@ pub fn apply_operator(op: &Operator, lhs: &Interval, rhs: 
&Interval) -> Result<I
         Operator::Minus => lhs.sub(rhs),
         Operator::Multiply => lhs.mul(rhs),
         Operator::Divide => lhs.div(rhs),
+        Operator::IsDistinctFrom | Operator::IsNotDistinctFrom => {

Review Comment:
   can apply_operator() return NullableInterval::Null or 
NullableInterval::MaybeNull here? isn't it strange for isDistinct and 
isNotDistinct?



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -1690,6 +1705,24 @@ impl Display for NullableInterval {
     }
 }
 
+impl From<&Interval> for NullableInterval {

Review Comment:
   This conversion is dangerous. Null means different things in Interval and 
NullableInterval. Interval's can be converted to NotNull types only in the 
current behavior. Where do we need such a conversion, and why? can you 
elaborate a bit



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -631,13 +623,19 @@ impl Interval {
     ///       to an error.
     pub fn intersect<T: Borrow<Self>>(&self, other: T) -> Result<Option<Self>> 
{
         let rhs = other.borrow();
+
         if self.data_type().ne(&rhs.data_type()) {
-            return internal_err!(
-                "Only intervals with the same data type are intersectable, 
lhs:{}, rhs:{}",
-                self.data_type(),
-                rhs.data_type()
-            );
-        };
+            BinaryTypeCoercer::new(&self.data_type(), &Operator::Plus, 
&rhs.data_type()).get_result_type()

Review Comment:
   should we check the result type is also supported by evaluate_bounds? WDYT



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -126,6 +126,25 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         not_impl_err!("Not implemented for {self}")
     }
 
+    /// Checks support of bounds evaluation for this expression, before 
evaluating bounds.
+    /// Returns None if bounds evaluation is not supported.
+    fn evaluate_bounds_checked(

Review Comment:
   We have supports_bounds_evaluation() there already. I think no need to 
introduce another API in PhysicalExpr. If we see lots of repetition of "first 
supports_bounds_evaluation, then evaluate_bounds" pattern, then we can make a 
standalone function, IMO



##########
datafusion/physical-expr/src/analysis.rs:
##########
@@ -94,17 +98,10 @@ pub struct ExprBoundaries {
 impl ExprBoundaries {
     /// Create a new `ExprBoundaries` object from column level statistics.
     pub fn try_from_column(
-        schema: &Schema,
+        field: &Field,
         col_stats: &ColumnStatistics,
         col_index: usize,
     ) -> Result<Self> {
-        let field = schema.fields().get(col_index).ok_or_else(|| {

Review Comment:
   Why do we spread this check over all caller side code?



##########
datafusion/physical-expr/src/expressions/in_list.rs:
##########
@@ -414,6 +415,51 @@ impl PhysicalExpr for InListExpr {
         }
         write!(f, ")")
     }
+
+    /// The output interval is computed by checking if the list item intervals 
are
+    /// a subset of, overlap, or are disjoint with the input expression's 
interval.
+    ///
+    /// If [InListExpr::negated] is true, the output interval gets negated.
+    ///
+    /// # Example:
+    /// If the input expression's interval is a superset of the
+    /// conjunction of the list items intervals, the output
+    /// interval is [`Interval::CERTAINLY_TRUE`].
+    ///
+    /// ```text
+    /// interval of expr:   ....---------------------....
+    /// Some list items:    ..........|..|.....|.|.......
+    ///
+    /// output interval:    [`true`, `true`]
+    /// ```
+    fn evaluate_bounds(&self, children: &[&Interval]) -> Result<Interval> {

Review Comment:
   My previous concern still applies. I think there should be 3 different paths 
of this logic:
   
   1) expr_bounds is singleton:
   check the inList expression **separately**. If any of them is also 
singleton, and has the same value with expr_bounds, then the result is 
CertainlyTrue
   2) expr_bounds is not a singleton => intersect expr_bounds with all inList 
expressions.
   a.  if any of inList expressions intersect with expr_bounds, then the result 
is Uncertain
   b. if any of inList does not intersect with expr_bounds, then the result is 
CertainlyFalse



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Support bounds evaluation for temporal data types [datafusion]

Reply via email to