kosiew commented on code in PR #22950:
URL: https://github.com/apache/datafusion/pull/22950#discussion_r3427585952
##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -187,6 +188,12 @@ pub struct ListingTable {
definition: Option<String>,
/// Cache for collected file statistics
collected_statistics: Option<Arc<dyn FileStatisticsCache>>,
+ /// Cache scoped to this [`ListingTable`].
+ ///
+ /// Anonymous tables do not have a stable table id in the shared cache key
+ /// and may read the same path with different explicit schemas. Use this
+ /// cache for those tables rather than populating the shared session cache.
+ local_statistics_cache: Arc<dyn FileStatisticsCache>,
Review Comment:
I think this fixes the anonymous `ListingTable` statistics reuse issue, but
it does so by giving each anonymous table its own `DefaultFileStatisticsCache`.
That seems to bypass the intended global
`datafusion.runtime.file_statistics_cache_limit`, since every anonymous table
gets a separate cache and `with_cache` copies the full shared cache limit into
each instance.
The invariant that was failing appears to be narrower: anonymous reads with
`SchemaSource::Specified` should not reuse statistics that were computed for
the same path under a different schema.
Could we avoid caching entirely for anonymous specified-schema tables
instead? Registered tables could continue using the shared cache through their
table reference, and anonymous inferred-schema reads could still share
statistics by path when the schema is inferred consistently.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]