adriangb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3259100142
Reading through the issues and posting my thoughts as I go. I am particularly interested in improving the `Statistics` that gets attached to files and partitions: https://github.com/pydantic/datafusion/blob/e6c2b754c1d59522314259658f272b412ee40589/datafusion/common/src/stats.rs#L270-L280 It seems that just hasn't been updated to use `Distribution` instead of `Precision`. Doing this requires a re-design of the `Statistics` struct and handling all of the breaking changes. I think v50 already has a lot of breaking changes so we should not try to put it into this release, but maybe v51. I have some ideas for other changes as well (namely: instead of requiring a `ColumnStatistics` for each column even those that are not present we can only include them for those that are somehow, otherwise a lot of memory is required for wide tables, it's fine for `Schema` but this structure exists once per file). @alamb @ozankabak let me know if that sounds correct -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org