adriangb commented on issue #19487: URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3695622895
> Now, I plan to build this statistics propagation feature with the existing `PruningStatistics` > > [datafusion/datafusion/common/src/pruning.rs](https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/common/src/pruning.rs#L63) > > Line 63 in [6ce2374](/apache/datafusion/commit/6ce237492d9f75477c594ba132b2575932122dd6) > > pub trait PruningStatistics { > Is it the case this trait can't populate stats for individual struct columns? Is it possible to extend `PruningStatistics` for it? And the later propagation steps should be able to handle them. It can’t populate / represent stats for nested fields. Struct columns don’t in and of themselves have statistics in Parquet, only their leaf fields do (except for maybe null counts? Not sure about that…). And `PruningStatistics` (and the `Statistics` struct that carries the information) don’t can only address top level columns. I think your proposal helps the situation but doesn’t totally resolve it. Functions like `get_field` could implement `propagate_range_stats` by extracting the field's stats into top level column stats, but we still need a way to represent that a single top level column may be composed of sub-structures and carry around the stats for all of that. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
