alamb commented on issue #18953: URL: https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492
Thank you @nuno-faria 🙏 > The FileStatisticsCache will need to be a trait with a list_entries type function to get the cached entries, just like FileMetadataCache. Since FileMetadataCache is public, it will cause breaking changes, but it shouldn't be difficult to migrate. I agree we'll need a real trait and your proposal makes sense to me > In your example there is a statistics_size_bytes, but I don't think that information is directly available in Statistics, unlike with the cache metadata entries. However, looking at the Statistics struct it appears to only store two Precision<usize> + five Precision<usize | ScalarValue> per column stored, which should be relatively small. I think this can get potentially quite large for wide tables or with large numbers of columns. For example, a [`ScalarValue` is 64 bytes each](https://github.com/apache/datafusion/blob/838e1dea832e3cd8585498ba12216e1ad9f584a4/datafusion/expr/src/expr.rs#L3942-L3941), so that means a 20 column table with 10 row groups each takes 6KB (without accounting for the size of Vecs, etc): 16 + (20 * 5 * 64) I recommend: 1. For the first PR, report a memory size of -1 or something to show it isn't actually accurate, and limit the size of the cache by number of `Statistics` objects 2. As a follow on PR, we can add a way to calculate the memory usage and also limit the memory usage by size -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
