[I] Partition Pruning Fails for Non-Deterministic Filters in Spark 3.3.x [hudi]

via GitHub Sun, 30 Nov 2025 04:30:29 -0800


hudi-bot opened a new issue, #17044:
URL: https://github.com/apache/hudi/issues/17044


   h4. *Summary*
   
   Partition pruning in Hudi fails when a query includes *non-deterministic 
expressions* (e.g. {{{}rand(){}}}) in the filter clause, even when a 
deterministic *partition filter* is present. This leads to *full partition 
scans* instead of optimized reads, significantly impacting performance.
   h4. *Observed Behavior*
   
   In {{{}HoodiePruneFileSourcePartitions.scala{}}}, the pruning rule uses:
   {code:java}
   override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
     case op @ PhysicalOperation(projects, filters, lr @ 
LogicalRelation(HoodieRelationMatcher(fileIndex), _, _, _))
       if !fileIndex.hasPredicatesPushedDown => {code}
   
   This uses {{{}PhysicalOperation{}}}, which in turn extends 
{{{}OperationHelper{}}}. The {{collectProjectsAndFilters}} method inside 
{{OperationHelper}} checks:
   {code:java}
   !legacyMode || condition.deterministic {code}
    
   Since {{legacyMode = true}} for {{{}PhysicalOperation{}}}, and if the 
condition is {*}not deterministic{*}, the filter is not collected — leading to 
*{{filters}} being empty* and partition pruning being skipped.
   h4. *Comparison with Spark's Native Pruning*
   
   Spark uses {{PruneFileSourcePartitions}} for native Parquet/Hive tables. Its 
code pattern is nearly identical, but it relies on {{{}ScanOperation{}}}, which 
overrides:
   {code:java}
   override protected def legacyMode: Boolean = false {code}
   This allows Spark to {*}collect deterministic partition filters even when 
mixed with non-deterministic ones{*}, and successfully apply partition pruning 
in the *physical plan only* (not logical).
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-9502
   - Type: Bug


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Partition Pruning Fails for Non-Deterministic Filters in Spark 3.3.x [hudi]

Reply via email to