isidentical commented on code in PR #3912:
URL: https://github.com/apache/arrow-datafusion/pull/3912#discussion_r1013331395
##########
datafusion/physical-expr/src/expressions/binary.rs:
##########
@@ -665,87 +656,81 @@ impl PhysicalExpr for BinaryExpr {
children[1].clone(),
)))
}
-}
-
-impl PartialEq<dyn Any> for BinaryExpr {
- fn eq(&self, other: &dyn Any) -> bool {
- down_cast_any_ref(other)
- .downcast_ref::<Self>()
- .map(|x| self.left.eq(&x.left) && self.op == x.op &&
self.right.eq(&x.right))
- .unwrap_or(false)
- }
-}
-
-struct BinaryExprStats {
- op: Operator,
- left: Arc<dyn PhysicalExpr>,
- right: Arc<dyn PhysicalExpr>,
-}
-impl PhysicalExprStats for BinaryExprStats {
- fn boundaries(&self, columns: &[ColumnStatistics]) ->
Option<ExprBoundaries> {
+ /// Return the boundaries of this binary expression's result. If the
expression itself
+ /// is a comparison which changes the boundaries of one of its inputs (a =
20 would pin
+ /// a to 20), then it might update the input's boundaries directly on the
context.
+ fn analyze(&self, context: &mut AnalysisContext) -> Option<ExprBoundaries>
{
Review Comment:
> Since AnalysisContext is in terms of the "current schema" it will likely
almost always end up being modified if the execution plan modifies the schema
(which most do).
You can think about as if it were a Statistics for that execution plan. It
will be generated by the same place and will be constructed from the existing
statistics (at least in `FilterExec`).
> Since an AnalysisContext has ExprBoundaries in it anyways, I recommend the
signature be
> fn analyze(&self) -> AnalysisContext {
If I understood it correctly, the problem with this approach is that it
won't be able to see the existing boundaries of any columns. The place that
wants to do the analysis needs to pass column information (otherwise it is not
going to be a very useful analysis; unless the expression itself is
self-sufficient `1 > 2`). All the children also need to receive a context, but
depending on the parent (`OR` for example) it might be a different one than the
one we received initially.
> As I have mentioned before, I don't fully understand specially how
column_boundaries can be used, but I am interested in seeing an example
The only 'unknown' during the analysis process is the column references.
Anything else can be inferred by the behaviour of the particular expression
itself. So we need a way of conveying that information to each sub-expression
when doing the analysis. I initially started using a vector of column
statistics but then once we had to change it on certain expressions, like for
example `a > 20 AND a < 70` where we have different boundaries for `a` when
evaluating the second part of the conjunction (`a < 70`), I went with a
mutable/shared context.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]