[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: Expression boundary analysis framework

GitBox Thu, 03 Nov 2022 12:46:00 -0700


isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013331395



##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
             children[1].clone(),
         )))
     }
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
-    fn eq(&self, other: &dyn Any) -> bool {
-        down_cast_any_ref(other)
-            .downcast_ref::<Self>()
-            .map(|x| self.left.eq(&x.left) && self.op == x.op && 
self.right.eq(&x.right))
-            .unwrap_or(false)
-    }
-}
-
-struct BinaryExprStats {
-    op: Operator,
-    left: Arc<dyn PhysicalExpr>,
-    right: Arc<dyn PhysicalExpr>,
-}
 
-impl PhysicalExprStats for BinaryExprStats {
-    fn boundaries(&self, columns: &[ColumnStatistics]) -> 
Option<ExprBoundaries> {
+    /// Return the boundaries of this binary expression's result. If the 
expression itself
+    /// is a comparison which changes the boundaries of one of its inputs (a = 
20 would pin
+    /// a to 20), then it might update the input's boundaries directly on the 
context.
+    fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries> 
{

Review Comment:
   > Since AnalysisContext is in terms of the "current schema" it will likely 
almost always end up being modified if the execution plan modifies the schema 
(which most do).
   
   You can think about as if it were a Statistics for that execution plan. It 
will be generated by the same place and will be constructed from the existing 
statistics (at least in `FilterExec`).
   
   > Since an AnalysisContext has ExprBoundaries in it anyways, I recommend the 
signature be
   > fn analyze(&self) -> AnalysisContext {
   
   If I understood it correctly, the problem with this approach is that it 
won't be able to see the existing boundaries of any columns. The place that 
wants to do the analysis needs to pass column information (otherwise it is not 
going to be a very useful analysis; unless the expression itself is 
self-sufficient `1 > 2`). All the children also need to receive a context, but 
depending on the parent (`OR` for example) it might be a different one than the 
one we received initially.
   
   > As I have mentioned before, I don't fully understand specially how 
column_boundaries can be used, but I am interested in seeing an example
   
   The only 'unknown' during the analysis process is the column references. 
Anything else can be inferred by the behaviour of the particular expression 
itself. So we need a way of conveying that information to each sub-expression 
when doing the analysis. I initially started using a vector of column 
statistics but then once we had to change it on certain expressions, like for 
example `a > 20 AND a < 70` where we have different boundaries for `a` when 
evaluating the second part of the conjunction (`a < 70`), I went with a 
mutable/shared context.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] isidentical commented on a diff in pull request #3912: Expression boundary analysis framework

Reply via email to