devinjdangelo commented on issue #8699:
URL:
https://github.com/apache/arrow-datafusion/issues/8699#issuecomment-1873355150
I took a look at the `Statistics` that a `TableProvider` can currently
return as @andygrove suggested on slack. It feels that the interface was
specifically designed/coupled to parquet as the only column level stats that
can be returned are:
```rust
/// Statistics for a column within a relation
#[derive(Clone, Debug, PartialEq, Eq, Default)]
pub struct ColumnStatistics {
/// Number of null values on column
pub null_count: Precision<usize>,
/// Maximum value of column
pub max_value: Precision<ScalarValue>,
/// Minimum value of column
pub min_value: Precision<ScalarValue>,
/// Number of distinct values
pub distinct_count: Precision<usize>,
}
```
If you think about a `TableProvider` backed by a fully fledged execution
engine, we could push down much more than this. E.g. we could compute the mean
of a column considering filter pushdown at the same time (if they are all
"exact" which they likely are in this case). I also think calling the feature
"Statistics" is confusing outside of the parquet context. Parquet statistics
are precomputed and stored in the file, but other `TableProvider`s could
calculate arbitrary "statistics" on the fly, which I think more commonly would
be called "aggregations" in this context.
Perhaps we could have a more general "AggregationPushdown feature" and
"Statistics" could be a special case implementation for parquet backed tables
to support push down of some Aggregations.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]