kosiew commented on code in PR #22950:
URL: https://github.com/apache/datafusion/pull/22950#discussion_r3427585952


##########
datafusion/catalog-listing/src/table.rs:
##########
@@ -187,6 +188,12 @@ pub struct ListingTable {
     definition: Option<String>,
     /// Cache for collected file statistics
     collected_statistics: Option<Arc<dyn FileStatisticsCache>>,
+    /// Cache scoped to this [`ListingTable`].
+    ///
+    /// Anonymous tables do not have a stable table id in the shared cache key
+    /// and may read the same path with different explicit schemas. Use this
+    /// cache for those tables rather than populating the shared session cache.
+    local_statistics_cache: Arc<dyn FileStatisticsCache>,

Review Comment:
   I think this fixes the anonymous `ListingTable` statistics reuse issue, but 
it does so by giving each anonymous table its own `DefaultFileStatisticsCache`.
   
   That seems to bypass the intended global 
`datafusion.runtime.file_statistics_cache_limit`, since every anonymous table 
gets a separate cache and `with_cache` copies the full shared cache limit into 
each instance.
   
   The invariant that was failing appears to be narrower: anonymous reads with 
`SchemaSource::Specified` should not reuse statistics that were computed for 
the same path under a different schema.
   
   Could we avoid caching entirely for anonymous specified-schema tables 
instead? Registered tables could continue using the shared cache through their 
table reference, and anonymous inferred-schema reads could still share 
statistics by path when the schema is inferred consistently.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to