adriangb commented on issue #19487:
URL: https://github.com/apache/datafusion/issues/19487#issuecomment-3695622895

   
   > Now, I plan to build this statistics propagation feature with the existing 
`PruningStatistics`
   > 
   > 
[datafusion/datafusion/common/src/pruning.rs](https://github.com/apache/datafusion/blob/6ce237492d9f75477c594ba132b2575932122dd6/datafusion/common/src/pruning.rs#L63)
   > 
   > Line 63 in 
[6ce2374](/apache/datafusion/commit/6ce237492d9f75477c594ba132b2575932122dd6)
   > 
   >  pub trait PruningStatistics { 
   > Is it the case this trait can't populate stats for individual struct 
columns? Is it possible to extend `PruningStatistics` for it? And the later 
propagation steps should be able to handle them.
   
   It can’t populate / represent stats for nested fields. Struct columns don’t 
in and of themselves have statistics in Parquet, only their leaf fields do 
(except for maybe null counts? Not sure about that…). And `PruningStatistics` 
(and the `Statistics` struct that carries the information) don’t can only 
address top level columns. I think your proposal helps the situation but 
doesn’t totally resolve it. Functions like `get_field` could implement 
`propagate_range_stats` by extracting the field's stats into top level column 
stats, but we still need a way to represent that a single top level column may 
be composed of sub-structures and carry around the stats for all of that.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to