alamb commented on issue #19052: URL: https://github.com/apache/datafusion/issues/19052#issuecomment-3603796097
My suggestion would be to implement @nuno-faria 's ideas here: https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492 And then once we had a real trait, add the ability to track memory used by Statistics and then use that measurement to implement a size limited cache This would be similar in design to @BlakeOrth 's PR here - https://github.com/apache/datafusion/pull/18855 I suggest a sequence of smaller PRs: 1. Add a method to statistics like `Statistics::heap_size` that reports the heap (allocation) size for statistics -- following the semantics of https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/parquet/src/file/metadata/memory.rs#L35 2. Update the `DefaultFileStatisticsCache` to have a memory limit and use the heap_size from 1 to constrain the size 3. Add a config parameter , perhaps `datafusion.runtime.statistics_cache_limit`, similar to [`datafusion.runtime.metadata_cache_limit`](https://datafusion.apache.org/user-guide/configs.html#runtime-configuration-settings), and set the default statistics cache in https://github.com/apache/datafusion/blob/71fcd03d614c466fdc8820084dc2896f69775f2b/datafusion/execution/src/cache/cache_manager.rs#L140-L141 (following the example of `DefaultFilesMetadataCache`) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
