yihua commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r845618339


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: 
Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || 
!HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that 
are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be 
effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters 
like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case 
for "struct")
+    //          - Any expression not directly referencing top-level column 
(for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || 
!isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   Yes.  Per discussion, `hoodie.metadata.enable` is still needed to make sure 
the right API fetching column stats is made to prevent any exception.  
`hoodie.metadata.index.column.stats.enable` might not be needed.  We need to 
revisit the abstraction and configs of reading metadata table as a whole in a 
separate effort.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: 
Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || 
!HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)

Review Comment:
   Synced up offline.  The concern is resolved.  See the comment below.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to