Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

via GitHub Thu, 20 Feb 2025 06:37:08 -0800


edmondop commented on code in PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#discussion_r1963629140



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -405,13 +406,18 @@ impl Interval {
 
         // There must be no way to create an interval whose endpoints have
         // different types.
-        assert!(
+        debug_assert!(

Review Comment:
   What's the rational in moving from an assert to debug assert? can we do 
validation in constructor hre 
https://github.com/apache/datafusion/pull/14699/files#diff-2a1941b7d923645151db0a34095b2d1dff5671eaa1103ec6111d0de21c31f5c3L293
 and remove the assert?



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -645,34 +651,78 @@ impl Interval {
         let upper = min_of_bounds(&self.upper, &rhs.upper);
 
         // New lower and upper bounds must always construct a valid interval.
-        assert!(
+        debug_assert!(
             (lower.is_null() || upper.is_null() || (lower <= upper)),
             "The intersection of two intervals can not be an invalid interval"
         );
 
         Ok(Some(Self { lower, upper }))
     }
 
-    /// Decide if this interval certainly contains, possibly contains, or can't
-    /// contain a [`ScalarValue`] (`other`) by returning `[true, true]`,
-    /// `[false, true]` or `[false, false]` respectively.
+    /// Compute the union of this interval with the given interval.
     ///
     /// NOTE: This function only works with intervals of the same data type.
     ///       Attempting to compare intervals of different data types will lead
     ///       to an error.
-    pub fn contains_value<T: Borrow<ScalarValue>>(&self, other: T) -> 
Result<bool> {
+    pub fn union<T: Borrow<Self>>(&self, other: T) -> Result<Self> {
         let rhs = other.borrow();
         if self.data_type().ne(&rhs.data_type()) {
+            return internal_err!(
+                "Cannot calculate the union of intervals with different data 
types, lhs:{}, rhs:{}",
+                self.data_type(),
+                rhs.data_type()
+            );
+        };
+
+        let lower = if self.lower.is_null()
+            || (!rhs.lower.is_null() && self.lower <= rhs.lower)
+        {
+            self.lower.clone()
+        } else {
+            rhs.lower.clone()
+        };
+        let upper = if self.upper.is_null()
+            || (!rhs.upper.is_null() && self.upper >= rhs.upper)
+        {
+            self.upper.clone()
+        } else {
+            rhs.upper.clone()
+        };
+
+        // New lower and upper bounds must always construct a valid interval.
+        debug_assert!(

Review Comment:
   In which scenario does this happen?



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -1119,11 +1180,11 @@ fn next_value_helper<const INC: bool>(value: 
ScalarValue) -> ScalarValue {
     match value {
         // f32/f64::NEG_INF/INF and f32/f64::NaN values should not emerge at 
this point.
         Float32(Some(val)) => {
-            assert!(val.is_finite(), "Non-standardized floating point usage");
+            debug_assert!(val.is_finite(), "Non-standardized floating point 
usage");

Review Comment:
   This will strip the assert away from production build, is this intended?



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -144,6 +153,111 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         Ok(Some(vec![]))
     }
 
+    /// Computes the output statistics for the expression, given the input
+    /// statistics.
+    ///
+    /// # Parameters
+    ///
+    /// * `children` are the statistics for the children (inputs) of this
+    ///   expression.
+    ///
+    /// # Returns
+    ///
+    /// A `Result` containing the output statistics for the expression in
+    /// case of success, or an error object in case of failure.
+    ///
+    /// Expressions (should) implement this function and utilize the 
independence
+    /// assumption, match on children distribution types and compute the output
+    /// statistics accordingly. The default implementation simply creates an
+    /// unknown output distribution by combining input ranges. This logic loses
+    /// distribution information, but is a safe default.
+    fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> 
Result<StatisticsV2> {
+        let children_ranges = children
+            .iter()
+            .map(|c| c.range())
+            .collect::<Result<Vec<_>>>()?;
+        let children_ranges_refs = children_ranges.iter().collect::<Vec<_>>();
+        let output_interval = 
self.evaluate_bounds(children_ranges_refs.as_slice())?;
+        let dt = output_interval.data_type();
+        if dt.eq(&DataType::Boolean) {
+            let p = if output_interval.eq(&Interval::CERTAINLY_TRUE) {
+                ScalarValue::new_one(&dt)
+            } else if output_interval.eq(&Interval::CERTAINLY_FALSE) {
+                ScalarValue::new_zero(&dt)
+            } else {
+                ScalarValue::try_from(&dt)
+            }?;
+            StatisticsV2::new_bernoulli(p)
+        } else {
+            StatisticsV2::new_from_interval(output_interval)
+        }
+    }
+
+    /// Updates children statistics using the given parent statistic for this
+    /// expression.
+    ///
+    /// This is used to propagate statistics down through an expression tree.
+    ///
+    /// # Parameters
+    ///
+    /// * `parent` is the currently known statistics for this expression.
+    /// * `children` are the current statistics for the children of this 
expression.
+    ///
+    /// # Returns
+    ///
+    /// A `Result` containing a `Vec` of new statistics for the children (in 
order)
+    /// in case of success, or an error object in case of failure.
+    ///
+    /// If statistics propagation reveals an infeasibility for any child, 
returns

Review Comment:
   I wonder if this doesn't call for a specific enum that is more explicit 
something like
   
   ```rust
   enum StatisticsPropagation{
          Unfeasible,
           Feasible(Vec<StatisticsV2>>,
   }
   ```
   
   I personally suffer a bit when I see Option<Collection> and I always wonder 
what's the difference between a none and an empty collection



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

Reply via email to