[GitHub] [hudi] yihua commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

GitBox Wed, 06 Apr 2022 18:19:42 -0700


yihua commented on code in PR #5244:
URL: https://github.com/apache/hudi/pull/5244#discussion_r844550905



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -48,7 +48,8 @@
       .sinceVersion("0.7.0")
       .withDocumentation("Enable the internal metadata table which serves 
table metadata like level file listings");
 
-  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = false;
+  // TODO rectify
+  public static final boolean DEFAULT_METADATA_ENABLE_FOR_READERS = true;

Review Comment:
   A reminder here to remove this change before merging.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: 
Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || 
!HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)
-      .contains(HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS)) {
+    // NOTE: Data Skipping is only effective when it references columns that 
are indexed w/in
+    //       the Column Stats Index (CSI). Following cases could not be 
effectively handled by Data Skipping:
+    //          - Expressions on top-level column's fields (ie, for ex filters 
like "struct.field > 0", since
+    //          CSI only contains stats for top-level columns, in this case 
for "struct")
+    //          - Any expression not directly referencing top-level column 
(for ex, sub-queries, since there's
+    //          nothing CSI in particular could be applied for)
+    lazy val queryReferencedColumns = collectReferencedColumns(spark, 
queryFilters, schema)
+
+    if (!isMetadataTableEnabled || !isColumnStatsIndexEnabled || 
!isColumnStatsIndexAvailable || !isDataSkippingEnabled) {

Review Comment:
   The check here should not rely on `isMetadataTableEnabled` 
(`hoodie.metadata.enable`) and `isColumnStatsIndexEnabled` 
(`hoodie.metadata.index.column.stats.enable`) which may not be the source of 
truth on the query side.  `isColumnStatsIndexAvailable` should be the only 
source of truth of whether col_stats partition is ready to read in metadata 
table.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##########
@@ -196,12 +191,20 @@ case class HoodieFileIndex(spark: SparkSession,
    * @return list of pruned (data-skipped) candidate base-files' names
    */
   private def lookupCandidateFilesInMetadataTable(queryFilters: 
Seq[Expression]): Try[Option[Set[String]]] = Try {
-    if (!isDataSkippingEnabled || queryFilters.isEmpty || 
!HoodieTableMetadataUtil.getCompletedMetadataPartitions(metaClient.getTableConfig)

Review Comment:
   The existing logic looks fine to me.  What's the gap here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] yihua commented on a diff in pull request #5244: [WIP][HUDI-3812] Fixing Data Skipping configuration to respect Metadata Table configs

Reply via email to