devinjdangelo commented on issue #8699:
URL: 
https://github.com/apache/arrow-datafusion/issues/8699#issuecomment-1873355150

   I took a look at the `Statistics` that a `TableProvider` can currently 
return as @andygrove suggested on slack. It feels that the interface was 
specifically designed/coupled to parquet as the only column level stats that 
can be returned are:
   
   ```rust
   /// Statistics for a column within a relation
   #[derive(Clone, Debug, PartialEq, Eq, Default)]
   pub struct ColumnStatistics {
       /// Number of null values on column
       pub null_count: Precision<usize>,
       /// Maximum value of column
       pub max_value: Precision<ScalarValue>,
       /// Minimum value of column
       pub min_value: Precision<ScalarValue>,
       /// Number of distinct values
       pub distinct_count: Precision<usize>,
   }
   ```
   
   If you think about a `TableProvider` backed by a fully fledged execution 
engine, we could push down much more than this. E.g. we could compute the mean 
of a column considering filter pushdown at the same time (if they are all 
"exact" which they likely are in this case). I also think calling the feature 
"Statistics" is confusing outside of the parquet context. Parquet statistics 
are precomputed and stored in the file, but other `TableProvider`s could 
calculate arbitrary "statistics" on the fly, which I think more commonly would 
be called "aggregations" in this context. 
   
   Perhaps we could have a more general "AggregationPushdown feature" and 
"Statistics" could be a special case implementation for parquet backed tables 
to support push down of some Aggregations. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to