isidentical opened a new pull request, #3868:
URL: https://github.com/apache/arrow-datafusion/pull/3868

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Part of #3845.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   DataFusion have some cost based optimizer rules (that operate on the 
physical plan) where statistics are leveraged into picking a new version of 
that plan with less 'possible' cost. One great example is the side swapping on 
hash join where we try to make the build side as small as possible. These rules 
work reasonably well when there is no interference between the joins and the 
table provider (since joins can now estimate their output cardinality) but one 
common thing that that can cut this flow is a filter node which currently do 
not support producing statistics (as shown in @Dandandan 's example in #3845).
   
   # What changes are included in this PR?
   
   This PR **does not** implement the statistics propagation from filters, but 
I can also include that. The main reasoning was that, since it is very 
self-contained (the new expression statistics API and the filter selectivity 
analysis), it could be had separately and then we can build the filter 
selectivity on top of it. If it makes sense to include it here as well, please 
let me know (or if this PR is too big by itself, I can also split it further, 
depending on what is easier to review!)
   
   #### Expression Statistics
   
   Since we needed a way of propagating statistics at the expression level 
(rather than the plan level), we now have a new API that is more suitable for 
it. Instead of dealing with individual columns, this API deals with boundaries 
of expression results. For the initial work, the API is implemented for the 
following expressions:
   
   - Literals, where the boundary is `min`/`max` are the value that it holds 
and `distinct_count` is `1`
   - Columns, where the boundary is the same as the boundary from the column 
stats (which are passed from the plan into the expressions, when using this 
API).
   - Comparisons (=, >, >=, <, <=)  with one column and one literal (actually 
not a literal, but an expression which can have its boundary reduced to single 
scalar). Using the filter selectivity analysis (that is also implemented in 
this PR) we estimate the `min`/`max` values (and the selectivity of the 
statistics) we can reason about the expressions lower and upper bounds (as well 
as its selectivity)
   
   There is also a new API for updating the statistics of a column with new, 
known boundaries. This is currently not used anywhere (although implemented and 
tested), but its primary functionality will be limiting the known maximums and 
minimums once we have support for composite predicates (`A = 50 AND B > A`, can 
now see that `A` is actually `50`, at least in that context). 
   
   #### Filter Selectivity Analysis
   
   The selectivity analysis here is based on the uniform distribution 
assumption (since we currently don't have histograms, and they might never come 
in the short term) and uses the basic column bounds to match the ranges with 
this assumption.
   
   Considering that `$` is a uniformly distributed column (e.g. `list(range(1, 
100+1))`), the selectivity of each operation is the following:
   
   | $ = [1, 100] | min          | max            | selectivity (formula)       
     | selectivity |
   
|--------------|--------------|----------------|----------------------------------|-------------|
   | $ = 75       | 75           | 75             | `1 / range($)`              
     | %1          |
   | $ >= 75      | 75           | `max($) = 100` | `((max($) - 50) + 1) / 
range($)` | %26         |
   | $ > 75       | 76*          | `max($) = 100` | `(max($) - 50) / range($)`  
     | %25         |
   | $ <= 75      | `min($) = 1` | 75             | `((75 - min($)) + 1) / 
range($)` | %75         |
   | $ < 75       | `min($) = 1` | 74*            | `(75 - min($)) / range($)`  
     | %74         |
   
   This uses a basic approach on set bound intersections (where since we know 
of the sets has a static bound) and with this we produce three different 
values: a new `min` (or the old one, depending on the operator), a new `max` 
(same), and a filter selectivity rate (number of rows the given expression 
would select **if it were uniformly distributed**, if you are interested in 
playing it out in different scenarios and seeing the real results I have a very 
[little script that can show 
it](https://gist.github.com/isidentical/f5910711c9e30e9ac4412cddf547e574)).
   
   
   #### Low Priority / Other (but still needed)
   
   - Moving `ColumnStatistics` and `Statistics` out of `core/physical_plan` 
(and into `datafusion-common`). They are still exposed from 
`core/physical_plan` to not to break the public API.
   
   - Implementing a general absolute distance method that works with numeric 
scalars, `scalar.distance(other)` would return `usize(|scalar - other|)`. This 
API is something that I saw was needed quite a bit when dealing with 
statistics, but if it is a bit too much (since we already have `scalar.sub`), I 
could also be persuaded into making it a utility method somewhere else.
   
   - `Operator.swap()` for returning a swapped version of the same operator 
that can function if lhs/rhs are also swapped.
   
   
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   Quite a bit if you count the new APIs, if not, not much.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to