suryaprasanna opened a new pull request, #18136:
URL: https://github.com/apache/hudi/pull/18136
### Describe the issue this Pull Request addresses
This PR addresses OOM (Out of Memory) issues on the Spark driver when
querying large Hudi datasets that have multiple versions of files in the same
partition. When file listing is performed without filtering, all file versions
are loaded into memory, which can cause the driver to run out of memory.
### Summary and Changelog
Users can now enable path filtering during file listing to avoid loading
multiple file versions into memory on the driver. This is controlled by the new
config
`hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter`.
**Changes:**
- Added new config `FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER`
(default: false) to enable HoodieROPathFilter during file listing
- Extended `HoodieTableMetadata`, `BaseTableMetadata`, and
`FileSystemBackedTableMetadata` to support optional `StoragePathFilter`
parameter
- Added `FSUtils.getAllDataFilesInPartition` overload that accepts path
filter option
- Created `HoodieROTableStoragePathFilter` wrapper to adapt Hadoop
PathFilter to Hudi's StoragePathFilter interface
- Updated `BaseHoodieTableFileIndex` to use path filter when enabled,
constructing FileSlices directly from filtered files
- Modified `SparkHoodieTableFileIndex` to wrap and apply
HoodieROTablePathFilter during partition file listing
### Impact
**Config Changes:**
- New config:
`hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter`
(default: false)
- When enabled, applies HoodieROTablePathFilter during file listing to
filter out older file versions
- Helps prevent OOM issues on driver for large tables with multiple file
versions
**Performance:**
- Reduces memory pressure on Spark driver for datasets with multiple file
versions per partition
- Enables successful queries on very large tables that previously failed
with OOM errors
### Risk Level
**Low** - The feature is behind a config flag (default: false) and does not
change existing behavior unless explicitly enabled.
**Verification:**
- Existing unit tests pass
- Tested on large production datasets with OOM issues - queries now succeed
with filter enabled
### Documentation Update
Config documentation is included in the `withDocumentation` method of the
new config property.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]