[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

GitBox Thu, 20 Oct 2022 12:06:58 -0700


isidentical commented on issue #3898:
URL: 
https://github.com/apache/arrow-datafusion/issues/3898#issuecomment-1286012013


   > I love this idea @isidentical -- thank you for filing it -- 
   
   Thanks @alamb (and for all your feedback during the initial design) ❤️ 
   
   > I am still not quite sure about how well keeping column_boundaries would 
work in practice, but I think I would just have to see how it works in practice 
to find out.
   
   It also bothers me a bit, so maybe we can iterate on it to see if there is a 
simpler solution that can help us to solve the following problem without having 
`apply()` and `column_boundaries`. 
   
   ```rs
   let expr = parse("a >= 20");
   let mut context = Context { column_boundaries: [Boundary {min: 1, max: 100}] 
}; // this can be constructed from statistics as well
   let boundaries = expr.analyze(&mut context);
   assert!(context.column_boundaries[0].min == 20);
   ```
   
   If we want the condition above to succeed (considering we now know that 
`a`'s minimum value is `20`, due to `a >= 20`), we need a way of translating 
that information to column level (so that, if the next expression [in the same 
context] is `a < 30` we can localize it even further, etc.).
   
   The most simple solution that I can think of is actually checking whether 
`left` side is a column, and if so, changing the boundaries for it 
(`context.boundaries[left.index] = Boundary {min: 20, max: left.max}`) directly 
inside the filter selectivity analysis. But what can we do if the expression 
looks something like this: `a + 1 > 20`?
   
   This is where `apply()` comes into play. When the analysis for `a + 1 >= 20` 
is complete, we go through the following cycles:
   - (`a + 1 >= 20`'s `analyze`) `left.apply(Boundary {min: 20, max: left.max})`
     - (`a + 1`'s `apply`) `left.apply(Boundary {min: boundaries.min - 1, max: 
boundaries.max - 1})`
       - (`a`'s `apply`) `context.columns[self.index] =  boundaries`
    
    If left were a simple column, it will look the same:
    - (`a >= 20`'s `analyze`) `left.apply(Boundary {min: 20, max: left.max})`
      - (`a`'s `apply`) `context.columns[self.index] =  boundaries`
   
   And if any of the expressions in the next conjunction (or actually anything 
that shares the same context) references `a`, then it will just use the updated 
boundaries without doing anything. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] isidentical commented on issue #3898: A framework for expression boundary analysis (and statistics)

Reply via email to