kosiew opened a new pull request, #21566:
URL: https://github.com/apache/datafusion/pull/21566

   
   ## Which issue does this PR close?
   
   * Part of #21554
   
   ---
   
   ## Rationale for this change
   
   Parquet scans currently adapt and simplify projection and predicate 
expressions per file, even when multiple files share the same physical schema 
and query inputs. This results in repeated CPU work (expression rewriting, 
simplification, and pruning predicate construction) that is identical across 
files.
   
   This PR introduces a scan-local cache to reuse this CPU-only setup when 
safe, reducing redundant computation and improving performance for datasets 
with many files but few schema variations.
   
   ---
   
   ## What changes are included in this PR?
   
   * Introduced `ParquetPruningSetupCache` owned by `ParquetMorselizer`
   
     * Stores adapted projection, predicate, and row-group pruning predicate
     * Uses a `Mutex + Condvar` to coordinate concurrent access and avoid 
duplicate work
   
   * Added `ParquetPruningSetupCacheKey`
   
     * Includes:
   
       * Logical file schema
       * Physical file schema
       * Predicate identity (pointer-based)
       * Projection expression identities
     * Ensures reuse only when inputs are equivalent within a scan
   
   * Added `ParquetPruningSetup` and cache entry state machine
   
     * States: `Pending`, `Ready`, `Failed`
     * Prevents duplicate computation and propagates errors safely
   
   * Refactored pruning setup logic
   
     * Extracted into `build_pruning_setup`
     * Added `build_or_get_pruning_setup` to handle cache lookup/fallback
     * Preserves existing behavior on cache miss
   
   * Integrated cache usage into `MetadataLoadedParquetOpen`
   
     * Replaces per-file rewrite + pruning predicate construction with cached 
version when applicable
   
   * Added `supports_reusable_rewrites` to `PhysicalExprAdapterFactory`
   
     * Defaults to `false`
     * Enabled for `DefaultPhysicalExprAdapterFactory`
     * Ensures only cache-safe adapters participate in reuse
   
   * Added `pruning_setup_reusable` guard
   
     * Disables reuse when literal column replacement occurs
   
   * Wired cache through `ParquetMorselizer` and `ParquetSource`
   
   ---
   
   ## Are these changes tested?
   
   Yes. New tests validate correctness and cache boundaries:
   
   * Reuse occurs for files with the same physical schema and reusable adapter
   * No reuse when adapter does not support reusable rewrites
   * No reuse across different physical schemas
   
   Additionally:
   
   * A custom counting adapter factory verifies that rewrite creation is 
invoked only once in cache-hit scenarios
   * Existing pruning behavior and correctness are preserved
   
   ---
   
   ## Are there any user-facing changes?
   
   No user-facing API changes.
   
   This is an internal performance optimization. Query results and behavior 
remain unchanged.
   
   ---
   
   ## LLM-generated code disclosure
   
   This PR includes LLM-generated code and comments. All LLM-generated content 
has been manually reviewed and tested.
   
   ---
   
   ## Notes
   
   * Cache is scoped to a single scan via `ParquetMorselizer`
   * Failure entries are removed to avoid poisoning the cache
   * Page-level pruning is intentionally excluded and handled separately 
(future work)
   
   ---
   
   ## Performance Impact
   
   Expected improvements for workloads with:
   
   * Many files
   * Few distinct physical schemas
   * Non-trivial predicates/projections
   
   Reduces repeated expression rewriting and pruning predicate construction 
overhead.
   
   ---
   
   ## Future Work
   
   * Extend caching to page-level pruning setup
   * Add metrics for cache hit/miss visibility
   * Explore more robust expression fingerprinting beyond pointer identity
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to