kosiew commented on PR #21566:
URL: https://github.com/apache/datafusion/pull/21566#issuecomment-4241211247

   @AdamGS 
   
   Thanks! I agree this is the direction we should take. This PR keeps the 
cache Parquet-local on purpose because the reusable setup currently stores 
Parquet-specific artifacts: the adapted Parquet projection/predicate, the 
physical schema after Parquet file-schema coercions / INT96 handling, and the 
row-group PruningPredicate. It also leaves page-index work, reader metadata, 
file metrics, and access-plan execution per file.
   
   That said, the shape is a good stepping stone toward the more 
format-independent problem in #20078. The parts that seem general are:
   
   scan-local reuse keyed by logical schema, physical schema, projection, 
predicate, and adapter cache-safety;
   avoiding repeated PhysicalExprAdapterFactory::create / rewrite / 
simplification work for files with equivalent schema inputs;
   letting custom adapters opt in only when their rewrites do not depend on 
factory-local or unkeyed per-file state.
   
   I would prefer to land this narrowly for Parquet first, with the 
cache-safety contract and tests in place, then follow up by extracting the 
format-neutral expression adaptation / pruning setup cache into a 
datasource-level helper once another FileSource can exercise it. Vortex or 
another custom FileSource would be a good second consumer to make sure the 
abstraction is not overfit to Parquet row-group pruning.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to