alamb commented on issue #19052:
URL: https://github.com/apache/datafusion/issues/19052#issuecomment-3603796097

   My suggestion would be to implement @nuno-faria 's ideas here: 
https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492
   
   And then once we had a real trait, add the ability to track memory used by 
Statistics and then use that measurement to implement a size limited cache
   
   This would be similar in design to @BlakeOrth 's PR here
   - https://github.com/apache/datafusion/pull/18855
   
   
   I suggest a sequence of smaller PRs:
   1. Add a method to statistics like `Statistics::heap_size` that reports the 
heap (allocation) size for statistics -- following the semantics of 
https://github.com/apache/arrow-rs/blob/b93fa52e47a83dff130709a6e80a4be5017dbf09/parquet/src/file/metadata/memory.rs#L35
   2. Update the `DefaultFileStatisticsCache` to have a memory limit and use 
the heap_size from 1 to constrain the size
   3. Add a config parameter , perhaps 
`datafusion.runtime.statistics_cache_limit`, similar to 
[`datafusion.runtime.metadata_cache_limit`](https://datafusion.apache.org/user-guide/configs.html#runtime-configuration-settings),
 and set the default statistics cache in 
https://github.com/apache/datafusion/blob/71fcd03d614c466fdc8820084dc2896f69775f2b/datafusion/execution/src/cache/cache_manager.rs#L140-L141
 (following the example of `DefaultFilesMetadataCache`)
   
    
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to