prashantwason commented on code in PR #17775:
URL: https://github.com/apache/hudi/pull/17775#discussion_r2770789667
##########
hudi-common/src/main/java/org/apache/hudi/BaseHoodieTableFileIndex.java:
##########
@@ -353,6 +361,27 @@ private HoodieTimeline findInstantsInRange() {
}
}
+ /**
+ * List partition paths matching path prefixes from the Catalog.
+ *
+ * File Index implementations can override this method to fetch partition
paths from the Catalog. This may be faster
+ * than listing all partition paths from the table metadata and filtering
them. This is definitely faster than
+ * listing all partition paths from the file system when metadata table may
not be enabled.
+ *
+ * Fetches all partition paths that are the sub-directories of the list of
provided (relative) paths.
+ * <p>
+ * E.g., Table has partition 4 partitions:
+ * year=2022/month=08/day=30, year=2022/month=08/day=31,
year=2022/month=07/day=03, year=2022/month=07/day=04
+ * The relative path "year=2022" returns all partitions, while the relative
path
+ * "year=2022/month=07" returns only two partitions.
+ *
+ * @param relativePathPrefixes The prefixes to relative partition paths that
must match
+ * @return null if not supported by File Index implementation
+ */
+ protected List<String> getMatchingPartitionPathsFromCatalog(List<String>
relativePathPrefixes) {
Review Comment:
Done. Added `isPartitionListingViaCatalogEnabled()` boolean method in both
the base class and SparkHoodieTableFileIndex. The code now uses this method
instead of checking for null return values.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -437,6 +439,56 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
private def arePartitionPathsUrlEncoded: Boolean =
metaClient.getTableConfig.getUrlEncodePartitioning.toBoolean
+ /**
+ * List partition paths matching path prefixes from the Catalog.
+ *
+ * File Index implementations can override this method to fetch partition
paths from the Catalog. This may be faster
+ * than listing all partition paths from the table metadata and filtering
them. This is definitely faster than
+ * listing all partition paths from the file system when metadata table may
not be enabled.
+ *
+ * Fetches all partition paths that are the sub-directories of the list of
provided (relative) paths.
+ * <p>
+ * E.g., Table has partition 4 partitions:
+ * year=2022/month=08/day=30, year=2022/month=08/day=31,
year=2022/month=07/day=03, year=2022/month=07/day=04
+ * The relative path "year=2022" returns all partitions, while the relative
path
+ * "year=2022/month=07" returns only two partitions.
+ *
+ * @param relativePathPrefixes The prefixes to relative partition paths that
must match
+ * @return null if not supported by File Index implementation
+ */
+ override protected def
getMatchingPartitionPathsFromCatalog(relativePathPrefixes: List[String]):
List[String] = {
+ // If listing from the catalog is disabled, or if MDT is available (which
is faster), return null
+ if
(!configProperties.getBoolean(FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED.key,
FILE_INDEX_LIST_PARTITION_PATHS_FROM_HMS_ENABLED.defaultValue())
+ || metaClient.getTableConfig.isMetadataTableAvailable) {
+ null
+ } else {
+ // Retrieve all the partition paths from the catalog
+ logInfo("Listing partition paths from the catalog using path prefixes "
+ relativePathPrefixes.toString)
+ val databaseName = metaClient.getTableConfig.getDatabaseName
+ val tableName = metaClient.getTableConfig.getTableName
+ val basePath = metaClient.getBasePath
+ val allPartitionPaths: Seq[String] =
spark.sessionState.catalog.externalCatalog
Review Comment:
Done. Refactored the branching logic to separate the case when
relativePathPrefixes is empty (return all partitions) from the case when
filtering is needed. Also renamed the variable to `filteredPartitionPaths` as
suggested.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]