hudi-bot opened a new issue, #17044:
URL: https://github.com/apache/hudi/issues/17044
h4. *Summary*
Partition pruning in Hudi fails when a query includes *non-deterministic
expressions* (e.g. {{{}rand(){}}}) in the filter clause, even when a
deterministic *partition filter* is present. This leads to *full partition
scans* instead of optimized reads, significantly impacting performance.
h4. *Observed Behavior*
In {{{}HoodiePruneFileSourcePartitions.scala{}}}, the pruning rule uses:
{code:java}
override def apply(plan: LogicalPlan): LogicalPlan = plan transformDown {
case op @ PhysicalOperation(projects, filters, lr @
LogicalRelation(HoodieRelationMatcher(fileIndex), _, _, _))
if !fileIndex.hasPredicatesPushedDown => {code}
This uses {{{}PhysicalOperation{}}}, which in turn extends
{{{}OperationHelper{}}}. The {{collectProjectsAndFilters}} method inside
{{OperationHelper}} checks:
{code:java}
!legacyMode || condition.deterministic {code}
Since {{legacyMode = true}} for {{{}PhysicalOperation{}}}, and if the
condition is {*}not deterministic{*}, the filter is not collected — leading to
*{{filters}} being empty* and partition pruning being skipped.
h4. *Comparison with Spark's Native Pruning*
Spark uses {{PruneFileSourcePartitions}} for native Parquet/Hive tables. Its
code pattern is nearly identical, but it relies on {{{}ScanOperation{}}}, which
overrides:
{code:java}
override protected def legacyMode: Boolean = false {code}
This allows Spark to {*}collect deterministic partition filters even when
mixed with non-deterministic ones{*}, and successfully apply partition pruning
in the *physical plan only* (not logical).
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-9502
- Type: Bug
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]