Re: [PR] feat(spark): add config to list partition paths from HMS in file index [hudi]

via GitHub Thu, 29 Jan 2026 12:53:22 -0800


nsivabalan commented on code in PR #17775:
URL: https://github.com/apache/hudi/pull/17775#discussion_r2743470256



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala:
##########
@@ -216,6 +216,16 @@ object DataSourceReadOptions {
         " by carefully analyzing provided partition-column predicates and 
deducing corresponding partition-path prefix from " +
         " them (if possible).")
 
+  val FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED: 
ConfigProperty[Boolean] =
+    
ConfigProperty.key("hoodie.datasource.read.file.index.list.partitions.from.hms")

Review Comment:
   how about 
   `hoodie.datasource.read.file.index.list.partitions.from.catalog` 



##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -353,6 +361,27 @@ private HoodieTimeline findInstantsInRange() {
     }
   }
 
+  /**
+   * List partition paths matching path prefixes from the Catalog.
+   *
+   * File Index implementations can override this method to fetch partition 
paths from the Catalog. This may be faster
+   * than listing all partition paths from the table metadata and filtering 
them. This is definitely faster than
+   * listing all partition paths from the file system when metadata table may 
not be enabled.
+   *
+   * Fetches all partition paths that are the sub-directories of the list of 
provided (relative) paths.
+   * <p>
+   * E.g., Table has partition 4 partitions:
+   * year=2022/month=08/day=30, year=2022/month=08/day=31, 
year=2022/month=07/day=03, year=2022/month=07/day=04
+   * The relative path "year=2022" returns all partitions, while the relative 
path
+   * "year=2022/month=07" returns only two partitions.
+   *
+   * @param relativePathPrefixes The prefixes to relative partition paths that 
must match
+   * @return null if not supported by File Index implementation
+   */
+  protected List<String> getMatchingPartitionPathsFromCatalog(List<String> 
relativePathPrefixes) {

Review Comment:
   can we also add a boolean method. Not a fan of relying on null return 
values. 
   
   then we cold do something like 
   
   ```
   boolean isPartitionListingViaCatalogEnabled = 
isPartitionListingViaCatalogEnabled()
   if (isPartitionListingViaCatalogEnabled) {
     matchedPartitionPaths = getMatchingPartitionPathsFromCatalog(...)
   } else. {
     matchedPartitionPaths = 
tableMetadata.getPartitionPathWithPathPrefixUsingFilterExpression(relativePartitionPaths,
               partitionFields, partitionColumnPredicates);
   }
   
   ```
   
   



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -527,6 +527,12 @@ object HoodieFileIndex extends Logging {
       
properties.setProperty(DataSourceReadOptions.FILE_INDEX_LISTING_MODE_OVERRIDE.key,
 listingModeOverride)
     }
 
+    var hmsListingEnabled = getConfigValue(options, sqlConf,

Review Comment:
   how about `partitionListingViaCatalog` 



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -437,6 +439,56 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
 
   private def arePartitionPathsUrlEncoded: Boolean =
     metaClient.getTableConfig.getUrlEncodePartitioning.toBoolean
+  /**
+   * List partition paths matching path prefixes from the Catalog.
+   *
+   * File Index implementations can override this method to fetch partition 
paths from the Catalog. This may be faster
+   * than listing all partition paths from the table metadata and filtering 
them. This is definitely faster than
+   * listing all partition paths from the file system when metadata table may 
not be enabled.
+   *
+   * Fetches all partition paths that are the sub-directories of the list of 
provided (relative) paths.
+   * <p>
+   * E.g., Table has partition 4 partitions:
+   * year=2022/month=08/day=30, year=2022/month=08/day=31, 
year=2022/month=07/day=03, year=2022/month=07/day=04
+   * The relative path "year=2022" returns all partitions, while the relative 
path
+   * "year=2022/month=07" returns only two partitions.
+   *
+   * @param relativePathPrefixes The prefixes to relative partition paths that 
must match
+   * @return null if not supported by File Index implementation
+   */
+  override protected def 
getMatchingPartitionPathsFromCatalog(relativePathPrefixes: List[String]): 
List[String] = {
+    // If listing from the catalog is disabled, or if MDT is available (which 
is faster), return null
+    if 
(!configProperties.getBoolean(FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED.key,
 FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED.defaultValue())
+        || metaClient.getTableConfig.isMetadataTableAvailable) {
+      null
+    } else {
+      // Retrieve all the partition paths from the catalog
+      logInfo("Listing partition paths from the catalog using path prefixes " 
+ relativePathPrefixes.toString)
+      val databaseName = metaClient.getTableConfig.getDatabaseName
+      val tableName = metaClient.getTableConfig.getTableName
+      val basePath = metaClient.getBasePath
+      val allPartitionPaths: Seq[String] = 
spark.sessionState.catalog.externalCatalog

Review Comment:
   we can branch this one out 
   ```
   if (relativePathPrefixes is not empty) {
   .. filter and return. 
   //    here lets name the var as filteredPartitionPaths instead of 
allPartitionPaths
   } else {
     return all partitons.
   
   }
   
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(spark): add config to list partition paths from HMS in file index [hudi]

Reply via email to