ch-sc opened a new issue, #14237:
URL: https://github.com/apache/datafusion/issues/14237
### Is your feature request related to a problem or challenge?
Today, statistics of filter predicates are based on interval arithmetic
invoked by `PhysicalExec::evaluate_bounds()`. This works fine for numerical
data. However, many expressions and datatypes are not supported by interval
arithmetics and therefore proper selectivity prediction is not supported for
such expressions.
I noticed there were lots of discussions regarding statistics in the project
lately. Work by folks from Synnada and others is currently in progress. If you
feel like this issue is already addressed please let me know. I'd like to offer
help with open tasks then.
### Describe the solution you'd like
1. Add support for some missing stuff in interval arithmetics, i.e.,
temporal data.
2. Add `PhysicalExpr::evaluate_statistics()` to calculate expression level
statistic. This was already proposed by others.
My suggestion is the following signature:
```rust
fn evaluate_statistics(&self, input_statistics: &Statistics) ->
Result<ExpressionStatistics>
```
I think this should return a new statistics struct on expression level which
could look like this:
```rust
pub struct ExpressionStatistics {
/// Number of null values
pub null_count: Precision<usize>,
/// number of output rows (cardinality)
pub num_rows: Precision<ScalarValue>,
/// total number of input rows
pub total_rows: Precision<ScalarValue>,
/// Number of distinct values
pub distinct_count: Precision<usize>,
}
```
With `evaluate_statistics()` we add support for filter expressions such as
string comparisons, `InList`, `LikeExpr`, or binary operators like
`IS_DISTINCT_FROM`, `IS_NOT_DISTINCT_FROM`. It may be an iterative approach
where we start with a few expression types and take it from there.
Selectivity calculation is trivial: `num_rows/total_rows`.
We can utilise `evaluate_bounds()` for supported expressions. For example,
from `2*A > B` we get its target boundaries and calculate the selectivity as is
done in `analysis::calculate_selectivity()`.
```rust
fn calculate_selectivity(
target_boundaries: &[ExprBoundaries],
initial_boundaries: &[ExprBoundaries],
) -> f64 {
// Since the intervals are assumed uniform and the values
// are not correlated, we need to multiply the selectivities
// of multiple columns to get the overall selectivity.
initial_boundaries
.iter()
.zip(target_boundaries.iter())
.fold(1.0, |acc, (initial, target)| {
acc * cardinality_ratio(&initial.interval, &target.interval)
})
}
```
This naive approach assumes uni-distributed data. Heuristics, like various
distribution types, could be added to `ExpressionStatisticsa` too. For the sake
of simplicity I will not address this here.
Happy to receive some feedback 🙂
### Describe alternatives you've considered
_No response_
### Additional context
Short disclaimer: I work for Coralogix like some other datafusion
contributors.
cc: @thinkharderdev
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]