[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3868: Implement foundational filter selectivity analysis

GitBox Wed, 19 Oct 2022 14:24:10 -0700


isidentical commented on code in PR #3868:
URL: https://github.com/apache/arrow-datafusion/pull/3868#discussion_r999946718



##########
datafusion/physical-expr/src/physical_expr.rs:
##########
@@ -61,6 +62,81 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug {
             Ok(tmp_result)
         }
     }
+    /// Return the expression statistics for this expression. This API is 
currently experimental.
+    fn expr_stats(&self) -> Arc<dyn PhysicalExprStats> {
+        Arc::new(BasicExpressionStats {})
+    }
+}
+
+/// Statistics about the result of a single expression.
+#[derive(Clone, Debug, PartialEq)]
+pub struct ExprBoundaries {
+    /// Maximum value this expression's result can have.
+    pub max_value: ScalarValue,
+    /// Minimum value this expression's result can have.
+    pub min_value: ScalarValue,
+    /// Maximum number of distinct values this expression can produce.
+    pub distinct_count: Option<usize>,
+    /// Selectivity of this expression if it were used as a predicate.
+    pub selectivity: Option<f64>,
+}
+
+impl ExprBoundaries {
+    /// Create a new `ExprBoundaries`.
+    pub fn new(
+        max_value: ScalarValue,
+        min_value: ScalarValue,
+        distinct_count: Option<usize>,
+    ) -> Self {
+        Self {
+            max_value,
+            min_value,
+            distinct_count,
+            selectivity: None,
+        }
+    }
+
+    /// Try to reduce the expression boundaries to a single value if possible.
+    pub fn reduce(&self) -> Option<ScalarValue> {
+        if self.min_value == self.max_value {
+            Some(self.min_value.clone())
+        } else {
+            None
+        }
+    }
+}
+
+/// A toolkit to work with physical expressions statistics. This API is 
currently experimental
+/// and might be subject to change.
+pub trait PhysicalExprStats: Send + Sync {
+    /// Return an estimate about the boundaries of this expression's result 
would have (in
+    /// terms of minimum and maximum values it can take as well the number of 
unique values
+    /// it can produce). The inputs are the column-level statistics from the 
current physical
+    /// plan.
+    fn boundaries(&self, columns: &[ColumnStatistics]) -> 
Option<ExprBoundaries>;
+
+    #[allow(unused_variables)]
+    /// Apply the given boundaries to this column. Currently only applicable 
for top level columns.
+    fn update_boundaries(
+        &self,
+        columns: &[ColumnStatistics],
+        boundaries: &ExprBoundaries,
+    ) -> Vec<ColumnStatistics> {
+        // TODO: for supporting recursive boundary updates, we need to have 
per-column level
+        // expression boundaries with known ids (either indexes or something 
like that).
+        columns.to_vec()
+    }

Review Comment:
   This was something that I thought would be interesting for composite 
expressions where boundaries can change during the processing loop. Imagine the 
following scenario where you have the basic expression below:
   
   ```
   a = 5 AND b > a
   ```
   
   The internal processing would look something like this (without 
`update_boundaries`):
   
   ```py
   original_col_stats = [ColStat {0, 100}, ColStat {0, 100}];
   # a = 5
   lhs_boundaries = lhs.boundaries(&original_col_stats);
   # b > a
   rhs_boundaries = rhs.boundaries(&original_col_stats);
   ```
   
   But it would miss a huge opportunity when processing the RHS, because we 
know that `a` must be `5` in this **context** (1). I thought since `col_stats` 
was the only context-level thing that we are passing, we might as well 
leverage(2) it to also include the updated information on columns. So the code 
above could now become the following:
   
   ```py
   original_col_stats = [ColStat {0, 100}, ColStat {0, 100}];
   # a = 5
   lhs_boundaries = lhs.boundaries(&original_col_stats);
   # [ColStat {5, 5}, ColStat {0, 100}]
   lhs_col_stats = lhs.col_stats(&original_col_stats);
   # b > a  <================== we now know a can be reduced to Scalar(5)
   rhs_boundaries = rhs.boundaries(&lhs_col_stats)
   ```
   
   ----
   
   1, 2: A much better way for doing this is actually having a real context, 
which would mean we store all the boundaries and update it as we go and if we 
need to fork it (`AND` is only one scenario, but we have to also consider `OR` 
and possibly other expressions / logical stuff), then the original would stay 
the same until replaced by the succeeding fork.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3868: Implement foundational filter selectivity analysis

Reply via email to