prashantwason opened a new pull request, #17775:
URL: https://github.com/apache/hudi/pull/17775

   ### Describe the issue this Pull Request addresses
   
   This PR adds a new configuration option to enable listing partition paths 
from HMS (Hive MetaStore) instead of the file system in the Spark 
HoodieFileIndex.
   
   When MDT (Metadata Table) is not enabled on the reader path, the list of 
partitions is fetched using FileSystemBackedTableMetadata which recursively 
lists all partitions from storage. This can be slow for large tables with 
hundreds of partitions and generates a large number of listing calls.
   
   When the new config is enabled, the partition listing is retrieved from the 
HMS and pruned on the driver side, providing better performance than 
FileSystemBackedTableMetadata.
   
   ### Summary and Changelog
   
   **New Config Added:**
   - `hoodie.datasource.read.file.index.list.partitions.from.hms` (default: 
false)
   
   **Changes:**
   - Added new boolean config 
`FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED` in `DataSourceOptions.scala`
   - Added `getMatchingPartitionPathsFromCatalog` method in 
`BaseHoodieTableFileIndex.java` that can be overridden by implementations
   - Implemented `getMatchingPartitionPathsFromCatalog` in 
`SparkHoodieTableFileIndex.scala` to fetch partitions from Spark's external 
catalog (HMS)
   - Updated `HoodieFileIndex.scala` to pass the new config property
   
   ### Impact
   
   - New user-facing configuration option for Spark-based queries
   - No breaking changes to existing APIs
   - When enabled, partition listing is obtained from HMS instead of file system
   
   ### Risk Level
   
   low - This is an opt-in feature with default value of false. The existing 
behavior remains unchanged unless explicitly enabled.
   
   ### Documentation Update
   
   The config description is self-documenting with proper documentation string:
   - `hoodie.datasource.read.file.index.list.partitions.from.hms`: Controls 
whether partition listing is obtained from the HMS instead of listing the file 
system to avoid recursively listing of large number of directories. Enabling 
this can reduce large amount of listing calls and speed up the queries for very 
large tables. This is only necessary when MDT is not enabled on the dataset as 
otherwise the MDT can provide the partition listing faster and without any 
actual listing on the file system.
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [ ] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to