alamb commented on issue #18953:
URL: https://github.com/apache/datafusion/issues/18953#issuecomment-3598406492

   Thank you @nuno-faria  🙏 
   
   > The FileStatisticsCache will need to be a trait with a list_entries type 
function to get the cached entries, just like FileMetadataCache. Since 
FileMetadataCache is public, it will cause breaking changes, but it shouldn't 
be difficult to migrate.
   
   I agree we'll need a real trait and your proposal makes sense to me
   
   > In your example there is a statistics_size_bytes, but I don't think that 
information is directly available in Statistics, unlike with the cache metadata 
entries. However, looking at the Statistics struct it appears to only store two 
Precision<usize> + five Precision<usize | ScalarValue> per column stored, which 
should be relatively small.
   
   I think this can get potentially quite large for wide tables or with large 
numbers of columns. For example, a [`ScalarValue` is 64 bytes 
each](https://github.com/apache/datafusion/blob/838e1dea832e3cd8585498ba12216e1ad9f584a4/datafusion/expr/src/expr.rs#L3942-L3941),
 so that means a 20 column table with 10 row groups each takes 6KB (without 
accounting for the size of Vecs, etc): 16 + (20 * 5 * 64)
   
   I recommend:
   1. For the first PR, report a memory size of -1 or something to show it 
isn't actually accurate, and limit the size of the cache by number of 
`Statistics` objects
   2. As a follow on PR, we can add a way to calculate the memory usage and 
also limit the memory usage by size
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to