kosiew opened a new pull request, #21566:
URL: https://github.com/apache/datafusion/pull/21566
## Which issue does this PR close?
* Part of #21554
---
## Rationale for this change
Parquet scans currently adapt and simplify projection and predicate
expressions per file, even when multiple files share the same physical schema
and query inputs. This results in repeated CPU work (expression rewriting,
simplification, and pruning predicate construction) that is identical across
files.
This PR introduces a scan-local cache to reuse this CPU-only setup when
safe, reducing redundant computation and improving performance for datasets
with many files but few schema variations.
---
## What changes are included in this PR?
* Introduced `ParquetPruningSetupCache` owned by `ParquetMorselizer`
* Stores adapted projection, predicate, and row-group pruning predicate
* Uses a `Mutex + Condvar` to coordinate concurrent access and avoid
duplicate work
* Added `ParquetPruningSetupCacheKey`
* Includes:
* Logical file schema
* Physical file schema
* Predicate identity (pointer-based)
* Projection expression identities
* Ensures reuse only when inputs are equivalent within a scan
* Added `ParquetPruningSetup` and cache entry state machine
* States: `Pending`, `Ready`, `Failed`
* Prevents duplicate computation and propagates errors safely
* Refactored pruning setup logic
* Extracted into `build_pruning_setup`
* Added `build_or_get_pruning_setup` to handle cache lookup/fallback
* Preserves existing behavior on cache miss
* Integrated cache usage into `MetadataLoadedParquetOpen`
* Replaces per-file rewrite + pruning predicate construction with cached
version when applicable
* Added `supports_reusable_rewrites` to `PhysicalExprAdapterFactory`
* Defaults to `false`
* Enabled for `DefaultPhysicalExprAdapterFactory`
* Ensures only cache-safe adapters participate in reuse
* Added `pruning_setup_reusable` guard
* Disables reuse when literal column replacement occurs
* Wired cache through `ParquetMorselizer` and `ParquetSource`
---
## Are these changes tested?
Yes. New tests validate correctness and cache boundaries:
* Reuse occurs for files with the same physical schema and reusable adapter
* No reuse when adapter does not support reusable rewrites
* No reuse across different physical schemas
Additionally:
* A custom counting adapter factory verifies that rewrite creation is
invoked only once in cache-hit scenarios
* Existing pruning behavior and correctness are preserved
---
## Are there any user-facing changes?
No user-facing API changes.
This is an internal performance optimization. Query results and behavior
remain unchanged.
---
## LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content
has been manually reviewed and tested.
---
## Notes
* Cache is scoped to a single scan via `ParquetMorselizer`
* Failure entries are removed to avoid poisoning the cache
* Page-level pruning is intentionally excluded and handled separately
(future work)
---
## Performance Impact
Expected improvements for workloads with:
* Many files
* Few distinct physical schemas
* Non-trivial predicates/projections
Reduces repeated expression rewriting and pruning predicate construction
overhead.
---
## Future Work
* Extend caching to page-level pruning setup
* Add metrics for cache hit/miss visibility
* Explore more robust expression fingerprinting beyond pointer identity
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]