nsivabalan commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r846676628
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
* @return list of pruned (data-skipped) candidate base-files' names
*/
private def lookupCandidateFilesInMetadataTable(queryFilters:
Seq[Expression]): Try[Option[Set[String]]] = Try {
- if (!isDataSkippingEnabled || queryFilters.isEmpty ||
!HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
- .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+ // NOTE: Data Skipping is only effective when it references columns that
are indexed w/in
+ // the Column Stats Index (CSI). Following cases could not be
effectively handled by Data Skipping:
+ // - Expressions on top-level column's fields (ie, for ex filters
like "struct.field > 0", since
+ // CSI only contains stats for top-level columns, in this case
for "struct")
+ // - Any expression not directly referencing top-level column
(for ex, sub-queries, since there's
+ // nothing CSI in particular could be applied for)
+ lazy val queryReferencedColumns = collectReferencedColumns(spark,
queryFilters, schema)
+
+ if (!isMetadataTableEnabled || !isColumnStatsIndexAvailable ||
!isDataSkippingEnabled) {
Review Comment:
also, how do we deduce what columns have been indexed in MDT CSI?
for eg, we have two flows.
a. hoodie.metadata.index.column.stats.all_columns.enable = true, where in
all cols will be enabled.
b. hoodie.metadata.index.column.stats.column.list set to list of columns to
be indexed.
So, when we are looking to apply data skipping on the query side, should we
check for these configs and decided whether a particular col is indexed by CSI
or not ?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]