kosiew commented on code in PR #23201:
URL: https://github.com/apache/datafusion/pull/23201#discussion_r3497939086


##########
datafusion/execution/src/cache/mod.rs:
##########
@@ -165,3 +168,145 @@ impl Display for TableScopedPath {
         }
     }
 }
+
+/// A fingerprint of the `file_schema` used to compute a file's statistics.
+///
+/// Captures exactly the attributes that determine the layout and meaning of
+/// `Statistics::column_statistics`: each column's name, data type and
+/// nullability, in order. It deliberately excludes field/schema metadata, 
which
+/// cannot affect statistics — including it would needlessly fragment the 
cache.
+#[derive(Clone, Debug)]
+pub struct SchemaFingerprint {

Review Comment:
   @mkleen,
   > have a look at the benchmarks results clickbench_partitioned and 
wide_schema you will find that without this optimization there are real 
regressions.
   
   Sorry I missed this.
   
   @Phoenix500526,
   
   I wonder whether  it would be viable to keep `TableScopedPath` as the cache 
key and store the schema fingerprint in the cached value too? The lookup would 
be:
   
   1. look up by the existing `{table, path}` key;
   2. validate file metadata as today (`size`, `last_modified`);
   3. also validate `cached.schema_fingerprint == current_schema_fingerprint`;
   4. if either validation fails, treat it as a miss and overwrite the entry 
for that `{table, path}`.
   
   This still fixes the bug: stats computed for one schema cannot be reused for 
another schema. It also preserves cache reuse for repeated anonymous 
explicit-schema reads with the same schema, which is the improvement over 
#22950.
   
   The tradeoff is that we would only keep one schema's stats per `{table, 
path}`, so alternating schemas for the same file may recompute. That seems 
acceptable to me because the previous fallback was to skip caching for this 
problematic case entirely, and a full multi-schema cache feels more complex 
than this issue needs.
   
   Benchmark-wise, this may improve Q6 if the regression is primarily from 
schema-aware cache keys, because the key hash/probe returns to the old cheap 
`{table, path}` shape. We would still need to verify with 
`clickbench_partitioned`, since value-side validation still has per-hit cost: 
the current schema fingerprint must exist and cache hits still need schema 
validation/fingerprint comparison. For wide schemas, the precomputed 
fingerprint hash can still keep that validation cheap. Memory/eviction should 
also improve because there is at most one entry per `{table, path}` rather than 
one per `{table, path, schema}`.
   
   If we take this approach, `FileStatisticsCache` can likely remain keyed by 
`TableScopedPath`, so the 55.0.0 upgrade-guide entry for changing the public 
cache key type can be removed or reduced to any smaller `CachedFileMetadata` 
API change.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to