adriangb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3259100142

   Reading through the issues and posting my thoughts as I go. I am 
particularly interested in improving the `Statistics` that gets attached to 
files and partitions:
   
   
https://github.com/pydantic/datafusion/blob/e6c2b754c1d59522314259658f272b412ee40589/datafusion/common/src/stats.rs#L270-L280
   
   It seems that just hasn't been updated to use `Distribution` instead of 
`Precision`. Doing this requires a re-design of the `Statistics` struct and 
handling all of the breaking changes. I think v50 already has a lot of breaking 
changes so we should not try to put it into this release, but maybe v51. I have 
some ideas for other changes as well (namely: instead of requiring a 
`ColumnStatistics` for each column even those that are not present we can only 
include them for those that are somehow, otherwise a lot of memory is required 
for wide tables, it's fine for `Schema` but this structure exists once per 
file).
   
   @alamb @ozankabak let me know if that sounds correct


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to