kosiew commented on PR #21566: URL: https://github.com/apache/datafusion/pull/21566#issuecomment-4241211247
@AdamGS Thanks! I agree this is the direction we should take. This PR keeps the cache Parquet-local on purpose because the reusable setup currently stores Parquet-specific artifacts: the adapted Parquet projection/predicate, the physical schema after Parquet file-schema coercions / INT96 handling, and the row-group PruningPredicate. It also leaves page-index work, reader metadata, file metrics, and access-plan execution per file. That said, the shape is a good stepping stone toward the more format-independent problem in #20078. The parts that seem general are: scan-local reuse keyed by logical schema, physical schema, projection, predicate, and adapter cache-safety; avoiding repeated PhysicalExprAdapterFactory::create / rewrite / simplification work for files with equivalent schema inputs; letting custom adapters opt in only when their rewrites do not depend on factory-local or unkeyed per-file state. I would prefer to land this narrowly for Parquet first, with the cache-safety contract and tests in place, then follow up by extracting the format-neutral expression adaptation / pruning setup cache into a datasource-level helper once another FileSource can exercise it. Vortex or another custom FileSource would be a good second consumer to make sure the abstraction is not overfit to Parquet row-group pruning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
