Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

via GitHub Sat, 22 Feb 2025 04:09:53 -0800


alamb commented on code in PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#discussion_r1966503758



##########
datafusion/physical-expr-common/src/physical_expr.rs:
##########
@@ -144,6 +153,111 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + 
DynEq + DynHash {
         Ok(Some(vec![]))
     }
 
+    /// Computes the output statistics for the expression, given the input
+    /// statistics.
+    ///
+    /// # Parameters
+    ///
+    /// * `children` are the statistics for the children (inputs) of this
+    ///   expression.
+    ///
+    /// # Returns
+    ///
+    /// A `Result` containing the output statistics for the expression in
+    /// case of success, or an error object in case of failure.
+    ///
+    /// Expressions (should) implement this function and utilize the 
independence
+    /// assumption, match on children distribution types and compute the output
+    /// statistics accordingly. The default implementation simply creates an
+    /// unknown output distribution by combining input ranges. This logic loses
+    /// distribution information, but is a safe default.
+    fn evaluate_statistics(&self, children: &[&StatisticsV2]) -> 
Result<StatisticsV2> {
+        let children_ranges = children
+            .iter()
+            .map(|c| c.range())
+            .collect::<Result<Vec<_>>>()?;
+        let children_ranges_refs = children_ranges.iter().collect::<Vec<_>>();
+        let output_interval = 
self.evaluate_bounds(children_ranges_refs.as_slice())?;
+        let dt = output_interval.data_type();
+        if dt.eq(&DataType::Boolean) {
+            let p = if output_interval.eq(&Interval::CERTAINLY_TRUE) {
+                ScalarValue::new_one(&dt)
+            } else if output_interval.eq(&Interval::CERTAINLY_FALSE) {
+                ScalarValue::new_zero(&dt)
+            } else {
+                ScalarValue::try_from(&dt)
+            }?;
+            StatisticsV2::new_bernoulli(p)

Review Comment:
   I think a Bernoulli described the distribution of expected outcomes of a 
binary random variable, rather than the distribution of values within an 
(existing) population. 
   
   In the context of database systems, I don't think it is common to model the 
distribution of values in a column as though they were the output of a random 
variable. 
   
   I would expect that the output distribution of a boolean expression to be 
something like
   1. Uniform (all values equally likely)
   2. Skewed (e.g. 25% values expected to be true, 50% values expected to be 
false, 25% values expected to be NULL)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] StatisticsV2: initial statistics framework redesign [datafusion]

Reply via email to