suryaprasanna opened a new pull request, #18136:
URL: https://github.com/apache/hudi/pull/18136

   ### Describe the issue this Pull Request addresses
   
   This PR addresses OOM (Out of Memory) issues on the Spark driver when 
querying large Hudi datasets that have multiple versions of files in the same 
partition. When file listing is performed without filtering, all file versions 
are loaded into memory, which can cause the driver to run out of memory.
   
   ### Summary and Changelog
   
   Users can now enable path filtering during file listing to avoid loading 
multiple file versions into memory on the driver. This is controlled by the new 
config 
`hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter`.
   
   **Changes:**
   - Added new config `FILE_INDEX_LIST_FILE_STATUSES_USING_RO_PATH_FILTER` 
(default: false) to enable HoodieROPathFilter during file listing
   - Extended `HoodieTableMetadata`, `BaseTableMetadata`, and 
`FileSystemBackedTableMetadata` to support optional `StoragePathFilter` 
parameter
   - Added `FSUtils.getAllDataFilesInPartition` overload that accepts path 
filter option
   - Created `HoodieROTableStoragePathFilter` wrapper to adapt Hadoop 
PathFilter to Hudi's StoragePathFilter interface
   - Updated `BaseHoodieTableFileIndex` to use path filter when enabled, 
constructing FileSlices directly from filtered files
   - Modified `SparkHoodieTableFileIndex` to wrap and apply 
HoodieROTablePathFilter during partition file listing
   
   ### Impact
   
   **Config Changes:**
   - New config: 
`hoodie.datasource.read.file.index.list.file.statuses.using.ro.path.filter` 
(default: false)
     - When enabled, applies HoodieROTablePathFilter during file listing to 
filter out older file versions
     - Helps prevent OOM issues on driver for large tables with multiple file 
versions
   
   **Performance:**
   - Reduces memory pressure on Spark driver for datasets with multiple file 
versions per partition
   - Enables successful queries on very large tables that previously failed 
with OOM errors
   
   ### Risk Level
   
   **Low** - The feature is behind a config flag (default: false) and does not 
change existing behavior unless explicitly enabled.
   
   **Verification:**
   - Existing unit tests pass
   - Tested on large production datasets with OOM issues - queries now succeed 
with filter enabled
   
   ### Documentation Update
   
   Config documentation is included in the `withDocumentation` method of the 
new config property.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to