codope commented on code in PR #12493:
URL: https://github.com/apache/hudi/pull/12493#discussion_r1888650993


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {

Review Comment:
   Snce this gets called after writing the completed instant, the table config 
`hoodie.table.metadata.partitions` must have been updated. So, instead of using 
the metadata config, can we use 
`metaClient().getTableConfig().getMetadataPartitions().contains(MetadataPartitionType.COLUMN_STATS.getPartitionPath())`
 ?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b)
 -> b).get();
+
+        final HoodieCommitMetadata mdtCommitMetadata = 
mdtMetaClient.getTimelineLayout().getCommitMetadataSerDe().deserialize(
+            latestInstant,
+            
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
+            HoodieCommitMetadata.class);
+        if 
(mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath()))
 {

Review Comment:
   Another issue is that the check is costly. If not for this check, then we 
don't need `mdtCommitMetadata` and hence don't need to create `mdtMetaClient`. 
If we strictly need this check, please consider caching the meta client or 
reusing an existing instance if possible.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b)
 -> b).get();

Review Comment:
   why call `reduce`? Wouldn't 
`mdtMetaClient.getActiveTimeline().filterCompletedInstants().lastInstant()` be 
sufficient?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b)
 -> b).get();
+
+        final HoodieCommitMetadata mdtCommitMetadata = 
mdtMetaClient.getTimelineLayout().getCommitMetadataSerDe().deserialize(
+            latestInstant,
+            
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
+            HoodieCommitMetadata.class);
+        if 
(mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath()))
 {
+          // update data table's table config for list of columns indexed.
+          List<String> columnsToIndex = 
HoodieTableMetadataUtil.getColumnsToIndex(commitMetadata, 
dataTable.getMetaClient(), config.getMetadataConfig());
+          // if col stats is getting updated, lets also update list of columns 
indexed if changed.
+          if 
(!dataTable.getMetaClient().getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex))
 {

Review Comment:
   how are nested fields getting tracked? Is it like `a.b`? Asking bcoz if a 
nested field has same name as any other non-nested field, then `equals` check 
could return unexpected result.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,b)
 -> b).get();
+
+        final HoodieCommitMetadata mdtCommitMetadata = 
mdtMetaClient.getTimelineLayout().getCommitMetadataSerDe().deserialize(
+            latestInstant,
+            
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
+            HoodieCommitMetadata.class);
+        if 
(mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath()))
 {

Review Comment:
   why is this check needed? I think the check on line 309 whether columns 
indexed has changed or not is sufficient.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieClient.java:
##########
@@ -284,4 +288,33 @@ protected void writeTableMetadata(HoodieTable table, 
String instantTime, HoodieC
       }
     }
   }
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata) {

Review Comment:
   Though it is safe to update in multi-writer scenario as this is still called 
within a transaction, I am thinking should we do all table config updates in 
one shot instead opening and closing the file multiple times? Let's say if this 
was the first commit building the colstats, then table config would be update 
multiple times, isn't it?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -477,7 +482,14 @@ private void initializeFromFilesystem(String 
initializationTime, List<MetadataPa
       HoodieData<HoodieRecord> records = 
fileGroupCountAndRecordsPair.getValue();
       bulkCommit(instantTimeForPartition, partitionName, records, 
fileGroupCount);
       metadataMetaClient.reloadActiveTimeline();
-
+      if (partitionType == COLUMN_STATS) {

Review Comment:
   let's use `MetadataPartitionType.COLUMN_STATS` enum as much as possible.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1364,24 +1342,47 @@ public static HoodieData<HoodieRecord> 
convertMetadataToColumnStatsRecords(Hoodi
     }
   }
 
+  public static List<String> getColumnsToIndex(HoodieCommitMetadata 
commitMetadata,

Review Comment:
   Though this is getting exercised in the functional tests in 
`TestColumnStatsIndex`, it would be nice to make it visible for testing and add 
a UT for this method in `TestHoodieTableMetadataUtil` if possible.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -336,6 +338,12 @@ public class HoodieTableConfig extends HoodieConfig {
       .withDocumentation("Comma-separated list of metadata partitions that 
have been completely built and in-sync with data table. "
           + "These partitions are ready for use by the readers");
 
+  public static final ConfigProperty<String> TABLE_COL_STATS_INDEXED_COLUMNS = 
ConfigProperty
+      .key("hoodie.table.col.stats.indexed.columns")

Review Comment:
   ```suggestion
         .key("hoodie.table.metadata.column.stats.indexed.columns")
   ```



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -477,7 +482,14 @@ private void initializeFromFilesystem(String 
initializationTime, List<MetadataPa
       HoodieData<HoodieRecord> records = 
fileGroupCountAndRecordsPair.getValue();
       bulkCommit(instantTimeForPartition, partitionName, records, 
fileGroupCount);
       metadataMetaClient.reloadActiveTimeline();
-
+      if (partitionType == COLUMN_STATS) {
+        // if col stats, lets also update list of columns indexed if changed.
+        if 
(!dataMetaClient.getTableConfig().getTableColStatsIndexedColumns().equals(columnsToIndex))
 {
+          LOG.info(String.format("List of columns to index is changing. Old 
value %s. New value %s", 
dataMetaClient.getTableConfig().getTableColStatsIndexedColumns(),
+              columnsToIndex));
+          
dataMetaClient.getTableConfig().setColStatsIndexedColumns(dataMetaClient, 
columnsToIndex);

Review Comment:
   If we're anyway going to check and update post commit, why doing it here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to