isidentical opened a new pull request, #3868:
URL: https://github.com/apache/arrow-datafusion/pull/3868
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases. You can
link an issue to this PR using the GitHub syntax. For example `Closes #123`
indicates that this PR will close issue #123.
-->
Part of #3845.
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in
the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your
changes and offer better suggestions for fixes.
-->
DataFusion have some cost based optimizer rules (that operate on the
physical plan) where statistics are leveraged into picking a new version of
that plan with less 'possible' cost. One great example is the side swapping on
hash join where we try to make the build side as small as possible. These rules
work reasonably well when there is no interference between the joins and the
table provider (since joins can now estimate their output cardinality) but one
common thing that that can cut this flow is a filter node which currently do
not support producing statistics (as shown in @Dandandan 's example in #3845).
# What changes are included in this PR?
This PR **does not** implement the statistics propagation from filters, but
I can also include that. The main reasoning was that, since it is very
self-contained (the new expression statistics API and the filter selectivity
analysis), it could be had separately and then we can build the filter
selectivity on top of it. If it makes sense to include it here as well, please
let me know (or if this PR is too big by itself, I can also split it further,
depending on what is easier to review!)
#### Expression Statistics
Since we needed a way of propagating statistics at the expression level
(rather than the plan level), we now have a new API that is more suitable for
it. Instead of dealing with individual columns, this API deals with boundaries
of expression results. For the initial work, the API is implemented for the
following expressions:
- Literals, where the boundary is `min`/`max` are the value that it holds
and `distinct_count` is `1`
- Columns, where the boundary is the same as the boundary from the column
stats (which are passed from the plan into the expressions, when using this
API).
- Comparisons (=, >, >=, <, <=) with one column and one literal (actually
not a literal, but an expression which can have its boundary reduced to single
scalar). Using the filter selectivity analysis (that is also implemented in
this PR) we estimate the `min`/`max` values (and the selectivity of the
statistics) we can reason about the expressions lower and upper bounds (as well
as its selectivity)
There is also a new API for updating the statistics of a column with new,
known boundaries. This is currently not used anywhere (although implemented and
tested), but its primary functionality will be limiting the known maximums and
minimums once we have support for composite predicates (`A = 50 AND B > A`, can
now see that `A` is actually `50`, at least in that context).
#### Filter Selectivity Analysis
The selectivity analysis here is based on the uniform distribution
assumption (since we currently don't have histograms, and they might never come
in the short term) and uses the basic column bounds to match the ranges with
this assumption.
Considering that `$` is a uniformly distributed column (e.g. `list(range(1,
100+1))`), the selectivity of each operation is the following:
| $ = [1, 100] | min | max | selectivity (formula)
| selectivity |
|--------------|--------------|----------------|----------------------------------|-------------|
| $ = 75 | 75 | 75 | `1 / range($)`
| %1 |
| $ >= 75 | 75 | `max($) = 100` | `((max($) - 50) + 1) /
range($)` | %26 |
| $ > 75 | 76* | `max($) = 100` | `(max($) - 50) / range($)`
| %25 |
| $ <= 75 | `min($) = 1` | 75 | `((75 - min($)) + 1) /
range($)` | %75 |
| $ < 75 | `min($) = 1` | 74* | `(75 - min($)) / range($)`
| %74 |
This uses a basic approach on set bound intersections (where since we know
of the sets has a static bound) and with this we produce three different
values: a new `min` (or the old one, depending on the operator), a new `max`
(same), and a filter selectivity rate (number of rows the given expression
would select **if it were uniformly distributed**, if you are interested in
playing it out in different scenarios and seeing the real results I have a very
[little script that can show
it](https://gist.github.com/isidentical/f5910711c9e30e9ac4412cddf547e574)).
#### Low Priority / Other (but still needed)
- Moving `ColumnStatistics` and `Statistics` out of `core/physical_plan`
(and into `datafusion-common`). They are still exposed from
`core/physical_plan` to not to break the public API.
- Implementing a general absolute distance method that works with numeric
scalars, `scalar.distance(other)` would return `usize(|scalar - other|)`. This
API is something that I saw was needed quite a bit when dealing with
statistics, but if it is a bit too much (since we already have `scalar.sub`), I
could also be persuaded into making it a utility method somewhere else.
- `Operator.swap()` for returning a swapped version of the same operator
that can function if lhs/rhs are also swapped.
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
Quite a bit if you count the new APIs, if not, not much.
<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]