suryaprasanna opened a new pull request, #17514:
URL: https://github.com/apache/hudi/pull/17514
### Describe the issue this Pull Request addresses
This PR implements column pruning optimization for incremental queries by
migrating `IncrementalRelationV1` and `IncrementalRelationV2` from `TableScan`
to `PrunedScan` interface. Currently, incremental queries read all columns from
the source files even when only a subset is needed, leading to unnecessary I/O
and memory overhead.
### Summary and Changelog
Users gain improved query performance for incremental reads by only reading
required columns from source files instead of the full schema.
**Changes:**
- Migrated `IncrementalRelationV1` and `IncrementalRelationV2` from
`TableScan` to `PrunedScan`
- Implemented `buildScan(requiredColumns: Array[String])` to accept column
pruning
- Added `getPrunedSchema()` to build schema with required columns plus
mandatory fields (_hoodie_commit_time and partition columns)
- Added `filterRequiredColumnsFromDF()` to remove auxiliary columns from
final result
- Enabled Parquet filter pushdown and record-level filters for better
optimization
- Updated `HoodieStreamSourceV1` and `HoodieStreamSourceV2` to pass
required columns
### Impact
**Performance improvement** - Incremental queries will read fewer columns
from Parquet files, reducing I/O and improving query latency, especially for
wide tables when selecting few columns.
### Risk Level
**Low** - This is an optimization that maintains backward compatibility.
The changes only affect the incremental query path and fallback to reading all
necessary fields if required.
### Documentation Update
None - This is an internal optimization with no user-facing configuration
changes or API modifications.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]