prashantwason opened a new pull request, #17775: URL: https://github.com/apache/hudi/pull/17775
### Describe the issue this Pull Request addresses This PR adds a new configuration option to enable listing partition paths from HMS (Hive MetaStore) instead of the file system in the Spark HoodieFileIndex. When MDT (Metadata Table) is not enabled on the reader path, the list of partitions is fetched using FileSystemBackedTableMetadata which recursively lists all partitions from storage. This can be slow for large tables with hundreds of partitions and generates a large number of listing calls. When the new config is enabled, the partition listing is retrieved from the HMS and pruned on the driver side, providing better performance than FileSystemBackedTableMetadata. ### Summary and Changelog **New Config Added:** - `hoodie.datasource.read.file.index.list.partitions.from.hms` (default: false) **Changes:** - Added new boolean config `FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED` in `DataSourceOptions.scala` - Added `getMatchingPartitionPathsFromCatalog` method in `BaseHoodieTableFileIndex.java` that can be overridden by implementations - Implemented `getMatchingPartitionPathsFromCatalog` in `SparkHoodieTableFileIndex.scala` to fetch partitions from Spark's external catalog (HMS) - Updated `HoodieFileIndex.scala` to pass the new config property ### Impact - New user-facing configuration option for Spark-based queries - No breaking changes to existing APIs - When enabled, partition listing is obtained from HMS instead of file system ### Risk Level low - This is an opt-in feature with default value of false. The existing behavior remains unchanged unless explicitly enabled. ### Documentation Update The config description is self-documenting with proper documentation string: - `hoodie.datasource.read.file.index.list.partitions.from.hms`: Controls whether partition listing is obtained from the HMS instead of listing the file system to avoid recursively listing of large number of directories. Enabling this can reduce large amount of listing calls and speed up the queries for very large tables. This is only necessary when MDT is not enabled on the dataset as otherwise the MDT can provide the partition listing faster and without any actual listing on the file system. ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [ ] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
