Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

via GitHub Fri, 21 Feb 2025 05:10:34 -0800


alamb commented on code in PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#discussion_r1965417121



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -1119,11 +1180,11 @@ fn next_value_helper<const INC: bool>(value: 
ScalarValue) -> ScalarValue {
     match value {
         // f32/f64::NEG_INF/INF and f32/f64::NaN values should not emerge at 
this point.
         Float32(Some(val)) => {
-            assert!(val.is_finite(), "Non-standardized floating point usage");
+            debug_assert!(val.is_finite(), "Non-standardized floating point 
usage");

Review Comment:
   right, so I think the question is still why not leave the check in always 
(even in production code)? I don't have a strong preference FWIW



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -144,6 +153,111 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         Ok(Some(vec![]))
     }
 
+    /// Computes the output statistics for the expression, given the input
+    /// statistics.
+    ///
+    /// # Parameters
+    ///
+    /// * `children` are the statistics for the children (inputs) of this
+    ///   expression.
+    ///
+    /// # Returns
+    ///
+    /// A `Result` containing the output statistics for the expression in
+    /// case of success, or an error object in case of failure.
+    ///
+    /// Expressions (should) implement this function and utilize the 
independence
+    /// assumption, match on children distribution types and compute the output
+    /// statistics accordingly. The default implementation simply creates an
+    /// unknown output distribution by combining input ranges. This logic loses
+    /// distribution information, but is a safe default.
+    fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> 
Result<StatisticsV2> {
+        let children_ranges = children
+            .iter()
+            .map(|c| c.range())
+            .collect::<Result<Vec<_>>>()?;
+        let children_ranges_refs = children_ranges.iter().collect::<Vec<_>>();
+        let output_interval = 
self.evaluate_bounds(children_ranges_refs.as_slice())?;
+        let dt = output_interval.data_type();
+        if dt.eq(&DataType::Boolean) {
+            let p = if output_interval.eq(&Interval::CERTAINLY_TRUE) {
+                ScalarValue::new_one(&dt)
+            } else if output_interval.eq(&Interval::CERTAINLY_FALSE) {
+                ScalarValue::new_zero(&dt)
+            } else {
+                ScalarValue::try_from(&dt)
+            }?;
+            StatisticsV2::new_bernoulli(p)

Review Comment:
   I don't understand why this would assume something about the distribution of 
the values (as in why does it assume a boolean variable has a bernoulli 
distribution 🤔 )



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -144,6 +153,111 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         Ok(Some(vec![]))
     }
 
+    /// Computes the output statistics for the expression, given the input
+    /// statistics.
+    ///
+    /// # Parameters
+    ///
+    /// * `children` are the statistics for the children (inputs) of this
+    ///   expression.
+    ///
+    /// # Returns
+    ///
+    /// A `Result` containing the output statistics for the expression in
+    /// case of success, or an error object in case of failure.
+    ///
+    /// Expressions (should) implement this function and utilize the 
independence
+    /// assumption, match on children distribution types and compute the output
+    /// statistics accordingly. The default implementation simply creates an
+    /// unknown output distribution by combining input ranges. This logic loses
+    /// distribution information, but is a safe default.
+    fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> 
Result<StatisticsV2> {

Review Comment:
   This is *very* cool -- I love this as a building block
   
   One  suggestion in terms of API design is that`&[&StatisticsV2]` pretty much 
requires using Vecs 
   
   I recommend adding some structure like `TableStatisticsV2` or 
`RelationStatisticsV2` that encapsulates the notion of a collection. Something 
like:
   
   ```rust
   struct RelationStatisticsV2 {
   ...
   }
   
   impl RelationStatistics {
      /// REturn statistics for column idx
     column(&self, idx: usize) -> &StatisticsV2 { ... }
   }
   ```
   That would make it easier to avoid copying / change underlying 
representations



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -0,0 +1,1610 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::f64::consts::LN_2;
+
+use crate::interval_arithmetic::{apply_operator, Interval};
+use crate::operator::Operator;
+use crate::type_coercion::binary::binary_numeric_coercion;
+
+use arrow::array::ArrowNativeTypeOp;
+use arrow::datatypes::DataType;
+use datafusion_common::rounding::alter_fp_rounding_mode;
+use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue};
+
+/// New, enhanced `Statistics` definition, represents five core statistical

Review Comment:
   I think the challenge of the above is to figure out how the APi looks to 
compute the distributions for different physical exprs (as the calculation is 
going to be different for different types of input distribitions 🤔 )



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -0,0 +1,1610 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::f64::consts::LN_2;
+
+use crate::interval_arithmetic::{apply_operator, Interval};
+use crate::operator::Operator;
+use crate::type_coercion::binary::binary_numeric_coercion;
+
+use arrow::array::ArrowNativeTypeOp;
+use arrow::datatypes::DataType;
+use datafusion_common::rounding::alter_fp_rounding_mode;
+use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue};
+
+/// New, enhanced `Statistics` definition, represents five core statistical

Review Comment:
   So I guess in my mind I see the following challenges:
   1. I am not sure about the practical use of several of these distributions
   2. There doesn't seem to be an easy (aka not having to change DataFusion's 
code) way to add other methods of statistic calculation
   
   Instead if a enum type approach, what would you think about a trait style 
one? This would allow users to encode arbitrary information about their 
distributions without changes to the core. 
   
   Something like:
   
   ```rust
   /// Describes how data is distributed across 
   pub trait Distribution {
       /// return the mean of this distribution
       pub fn mean(&self) -> Result<ScalarValue>;
       /// return the range of ths
       pub fn range(&self) -> Result<Interval>;
       pub fn data_type(&self) -> DataType;
   ...
   }
   
   /// DataFusion provides some built in distributions 
   impl Distribution for UnknownDistribution {
   ...
   }
   
   impl Distribution for UniformDistribution {
   ...
   }
   ...
   
   ```
   
   
   



##########
datafusion/expr-common/src/interval_arithmetic.rs:
##########
@@ -645,34 +651,78 @@ impl Interval {
         let upper = min_of_bounds(&self.upper, &rhs.upper);
 
         // New lower and upper bounds must always construct a valid interval.
-        assert!(
+        debug_assert!(
             (lower.is_null() || upper.is_null() || (lower <= upper)),
             "The intersection of two intervals can not be an invalid interval"
         );
 
         Ok(Some(Self { lower, upper }))
     }
 
-    /// Decide if this interval certainly contains, possibly contains, or can't
-    /// contain a [`ScalarValue`] (`other`) by returning `[true, true]`,
-    /// `[false, true]` or `[false, false]` respectively.
+    /// Compute the union of this interval with the given interval.
     ///
     /// NOTE: This function only works with intervals of the same data type.
     ///       Attempting to compare intervals of different data types will lead
     ///       to an error.
-    pub fn contains_value<T: Borrow<ScalarValue>>(&self, other: T) -> 
Result<bool> {
+    pub fn union<T: Borrow<Self>>(&self, other: T) -> Result<Self> {

Review Comment:
   FWIW it might be more efficient to take self and other by value here (not 
reference) -- the API as is will force values to be cloned, even if self/other 
are not used elsewhere
   
   If it was something like 
   ```rust
   pub fn union(self, other: Self) -> Result<Self> {
   ...
   ```
   
   It could potentially reuse the allocation



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -0,0 +1,1610 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+use std::f64::consts::LN_2;
+
+use crate::interval_arithmetic::{apply_operator, Interval};
+use crate::operator::Operator;
+use crate::type_coercion::binary::binary_numeric_coercion;
+
+use arrow::array::ArrowNativeTypeOp;
+use arrow::datatypes::DataType;
+use datafusion_common::rounding::alter_fp_rounding_mode;
+use datafusion_common::{internal_err, not_impl_err, Result, ScalarValue};
+
+/// New, enhanced `Statistics` definition, represents five core statistical

Review Comment:
   While these 5 distributions are very cool, I am not sure I have ever run 
into them being used in a practical database system (as real world data often 
doesn't neatly follow any existing distribution) 
   
   As I understand it, typically  statistics estimation is done via:
   1. Assume a uniform distribution (often not particularly accurate, but very 
simple to implement and reason about)
   2. Use equi-height histograms measured across the data
   3. Some sort of sketch for distinct values and correlation between columns
   
   That being said, I am not a statistics expert
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

Reply via email to