xushiyan commented on code in PR #6680:
URL: https://github.com/apache/hudi/pull/6680#discussion_r1022548772
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala:
##########
@@ -237,70 +246,64 @@ class SparkHoodieTableFileIndex(spark: SparkSession,
}
}
- private def listMatchingPartitionPathsInternal(partitionColumnNames:
Seq[String],
- partitionColumnPredicates:
Seq[Expression]): Seq[PartitionPath] = {
- // NOTE: Here we try to to achieve efficiency in avoiding necessity to
recursively list deep folder structures of
- // partitioned tables w/ multiple partition columns, by carefully
analyzing provided partition predicates:
- //
- // In cases when partition-predicates have
- // - The form of equality predicates w/ static literals (for ex,
like `date = '2022-01-01'`)
- // - Fully specified proper prefix of the partition schema (ie
fully binding first N columns
- // of the partition schema adhering to hereby described rules)
- //
- // We will try to exploit this specific structure, and try to reduce the
scope of a
- // necessary file-listings of partitions of the table to just the
sub-folder under relative prefix
- // of the partition-path derived from the partition-column predicates. For
ex, consider following
- // scenario:
- //
- // Table's partition schema (in-order):
- //
- // country_code: string (for ex, 'us')
- // date: string (for ex, '2022-01-01')
- //
- // Table's folder structure:
- // us/
- // |- 2022-01-01/
- // |- 2022-01-02/
- // ...
- //
- // In case we have incoming query specifies following predicates:
- //
- // `... WHERE country_code = 'us' AND date = '2022-01-01'`
- //
- // We can deduce full partition-path w/o doing a single listing:
`us/2022-01-01`
- if (areAllPartitionPathsCached ||
!shouldUsePartitionPathPrefixAnalysis(configProperties)) {
- logDebug("All partition paths have already been cached, use it directly")
+ // NOTE: Here we try to to achieve efficiency in avoiding necessity to
recursively list deep folder structures of
+ // partitioned tables w/ multiple partition columns, by carefully
analyzing provided partition predicates:
+ //
+ // In cases when partition-predicates have
+ // - The form of equality predicates w/ static literals (for ex,
like `date = '2022-01-01'`)
+ // - Fully specified proper prefix of the partition schema (ie fully
binding first N columns
+ // of the partition schema adhering to hereby described rules)
+ //
+ // We will try to exploit this specific structure, and try to reduce the
scope of a
+ // necessary file-listings of partitions of the table to just the sub-folder
under relative prefix
+ // of the partition-path derived from the partition-column predicates. For
ex, consider following
+ // scenario:
+ //
+ // Table's partition schema (in-order):
+ //
+ // country_code: string (for ex, 'us')
+ // date: string (for ex, '2022-01-01')
+ //
+ // Table's folder structure:
+ // us/
+ // |- 2022-01-01/
+ // |- 2022-01-02/
+ // ...
+ //
+ // In case we have incoming query specifies following predicates:
+ //
+ // `... WHERE country_code = 'us' AND date = '2022-01-01'`
+ //
+ // We can deduce full partition-path w/o doing a single listing:
`us/2022-01-01`
Review Comment:
correct
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]