[PR] feat(spark): implement column pruning for incremental queries [hudi]

via GitHub Sun, 07 Dec 2025 14:54:01 -0800


suryaprasanna opened a new pull request, #17514:
URL: https://github.com/apache/hudi/pull/17514


   ### Describe the issue this Pull Request addresses
   
   This PR implements column pruning optimization for incremental queries by 
migrating `IncrementalRelationV1` and `IncrementalRelationV2` from `TableScan` 
to `PrunedScan` interface. Currently, incremental queries read all columns from 
the source files even when only a subset is needed, leading to unnecessary I/O 
and memory overhead.
   
   ### Summary and Changelog
   
   Users gain improved query performance for incremental reads by only reading 
required columns from source files instead of the full schema.
   
    **Changes:**
     - Migrated `IncrementalRelationV1` and `IncrementalRelationV2` from 
`TableScan` to `PrunedScan`
     - Implemented `buildScan(requiredColumns: Array[String])` to accept column 
pruning
     - Added `getPrunedSchema()` to build schema with required columns plus 
mandatory fields (_hoodie_commit_time and partition columns)
     - Added `filterRequiredColumnsFromDF()` to remove auxiliary columns from 
final result
     - Enabled Parquet filter pushdown and record-level filters for better 
optimization
     - Updated `HoodieStreamSourceV1` and `HoodieStreamSourceV2` to pass 
required columns
   
   ### Impact
   
    **Performance improvement** - Incremental queries will read fewer columns 
from Parquet files, reducing I/O and improving query latency, especially for 
wide tables when selecting few columns.
   
   ### Risk Level
   
    **Low** - This is an optimization that maintains backward compatibility. 
The changes only affect the incremental query path and fallback to reading all 
necessary fields if required.
   
   ### Documentation Update
   
    None - This is an internal optimization with no user-facing configuration 
changes or API modifications.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(spark): implement column pruning for incremental queries [hudi]

Reply via email to