codope commented on code in PR #12529:
URL: https://github.com/apache/hudi/pull/12529#discussion_r1893860300


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieIndexClientUtils.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Functions;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
+import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.List;
+
+public class HoodieIndexClientUtils {
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata,
+                                Functions.Function2<HoodieTableMetaClient, 
List<String>, Void> updateColSatsFunc) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {

Review Comment:
   Snce this gets called after writing the completed instant, the table config 
`hoodie.table.metadata.partitions` must have been updated. So, instead of using 
the metadata config, can we use 
`metaClient().getTableConfig().getMetadataPartitions().contains(MetadataPartitionType.COLUMN_STATS.getPartitionPath())`
 ?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieIndexClientUtils.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Functions;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
+import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.List;
+
+public class HoodieIndexClientUtils {
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata,
+                                Functions.Function2<HoodieTableMetaClient, 
List<String>, Void> updateColSatsFunc) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,
 b) -> b).get();

Review Comment:
   As discussed, if it is possible to avoid `reduce` by calling 
`getCommitsTimeline` or `getWriteTimeline`, then let's do that.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieIndexClientUtils.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Functions;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
+import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.List;
+
+public class HoodieIndexClientUtils {
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata,
+                                Functions.Function2<HoodieTableMetaClient, 
List<String>, Void> updateColSatsFunc) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,
 b) -> b).get();
+
+        final HoodieCommitMetadata mdtCommitMetadata = 
mdtMetaClient.getTimelineLayout().getCommitMetadataSerDe().deserialize(
+            latestInstant,
+            
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
+            HoodieCommitMetadata.class);
+        if 
(mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath()))
 {
+          // update data table's table config for list of columns indexed.
+          List<String> columnsToIndex = 
HoodieTableMetadataUtil.getColumnsToIndex(commitMetadata, 
dataTable.getMetaClient(), config.getMetadataConfig());
+          // if col stats is getting updated, lets also update list of columns 
indexed if changed.
+          updateColSatsFunc.apply(dataTable.getMetaClient(), columnsToIndex);

Review Comment:
   how are nested fields getting tracked? Is it like `a.b`? Asking bcoz if a 
nested field has same name as any other non-nested field, then equals check 
could return unexpected result.



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java:
##########
@@ -232,6 +232,41 @@ public void buildIndexDefinition(HoodieIndexDefinition 
indexDefinition) {
     }
   }
 
+  /**
+   * Builds expression index definition and writes to index definition file.
+   */
+  public boolean buildColSatsIndexDefinition(HoodieIndexDefinition 
indexDefinition) {
+    String indexName = indexDefinition.getIndexName();
+    String indexMetaPath = getIndexDefinitionPath();
+    boolean updateIndexDefn = false;
+    if (indexMetadataOpt.isPresent()) {
+      if (indexMetadataOpt.get().getIndexDefinitions().containsKey(indexName)) 
{

Review Comment:
   I think we can merge this method with the method above - 
`buildIndexDefinition`. At this point, I don't see a need to separate the two. 
Any special handling for columns stats can be done within the method. If at all 
a new method is necessary, I would still recommend to keep the file i/o related 
stuff in a common method, and just separate out the in-memory manipulation of 
index definition in separate methods. Main concern is reuse. Tomorrow, if we 
make any change in the protocol to update/delete the index definition file then 
we don't have to change at two places.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -477,7 +482,16 @@ private void initializeFromFilesystem(String 
initializationTime, List<MetadataPa
       HoodieData<HoodieRecord> records = 
fileGroupCountAndRecordsPair.getValue();
       bulkCommit(instantTimeForPartition, partitionName, records, 
fileGroupCount);
       metadataMetaClient.reloadActiveTimeline();
-
+      if (partitionType == COLUMN_STATS) {
+        // initialize Col Stats
+        // if col stats, lets also update list of columns indexed if changed.
+        updateColumnsToIndexWithColStats(columnsToIndex);

Review Comment:
   If we're anyway going to check and update post commit, why doing it here?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -519,15 +536,28 @@ private Pair<Integer, HoodieData<HoodieRecord>> 
initializePartitionStatsIndex(Li
     return Pair.of(fileGroupCount, records);
   }
 
-  private Pair<Integer, HoodieData<HoodieRecord>> 
initializeColumnStatsPartition(Map<String, Map<String, Long>> 
partitionToFilesMap) {
+  private Pair<List<String>, Pair<Integer, HoodieData<HoodieRecord>>> 
initializeColumnStatsPartition(Map<String, Map<String, Long>> 
partitionToFilesMap) {
+    // Find the columns to index
+    final List<String> columnsToIndex = 
HoodieTableMetadataUtil.getColumnsToIndex(dataMetaClient.getTableConfig(),
+        dataWriteConfig.getMetadataConfig(), Lazy.lazily(() -> 
HoodieTableMetadataUtil.tryResolveSchemaForTable(dataMetaClient)));
+    if (columnsToIndex.isEmpty()) {
+      // atleast meta fields will be chosen for indexing. So, we should not 
reach this state.

Review Comment:
   what happens when `populateMetaFields` is disabled? Should we still error 
out?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/index/functional/BaseHoodieIndexClient.java:
##########
@@ -61,6 +71,8 @@ public void register(HoodieTableMetaClient metaClient, 
HoodieIndexDefinition ind
    */
   public abstract void create(HoodieTableMetaClient metaClient, String 
indexName, String indexType, Map<String, Map<String, String>> columns, 
Map<String, String> options) throws Exception;
 
+  public abstract void createOrUpdateColStatsIndex(HoodieTableMetaClient 
metaClient, List<String> columnsToIndex);

Review Comment:
   how about rename to `createOrUpdateColumnStatsIndexDefinition`? The method 
is not actually creating/updating the stats right.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/HoodieIndexClientUtils.java:
##########
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.hudi.client;
+
+import org.apache.hudi.common.model.HoodieCommitMetadata;
+import org.apache.hudi.common.table.HoodieTableMetaClient;
+import org.apache.hudi.common.table.timeline.HoodieInstant;
+import org.apache.hudi.common.util.Functions;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.exception.HoodieException;
+import org.apache.hudi.metadata.HoodieTableMetadata;
+import org.apache.hudi.metadata.HoodieTableMetadataUtil;
+import org.apache.hudi.metadata.MetadataPartitionType;
+import org.apache.hudi.table.HoodieTable;
+
+import java.util.List;
+
+public class HoodieIndexClientUtils {
+
+  public static void updateColsToIndex(HoodieTable dataTable, 
HoodieWriteConfig config, HoodieCommitMetadata commitMetadata,
+                                Functions.Function2<HoodieTableMetaClient, 
List<String>, Void> updateColSatsFunc) {
+    if (config.getMetadataConfig().isColumnStatsIndexEnabled()) {
+      try {
+        HoodieTableMetaClient mdtMetaClient = HoodieTableMetaClient.builder()
+            .setStorage(dataTable.getStorage())
+            
.setBasePath(HoodieTableMetadata.getMetadataTableBasePath(dataTable.getMetaClient().getBasePath()))
+            .build();
+        HoodieInstant latestInstant = 
mdtMetaClient.getActiveTimeline().filterCompletedInstants().getInstantsOrderedByCompletionTime().reduce((a,
 b) -> b).get();
+
+        final HoodieCommitMetadata mdtCommitMetadata = 
mdtMetaClient.getTimelineLayout().getCommitMetadataSerDe().deserialize(
+            latestInstant,
+            
mdtMetaClient.getActiveTimeline().getInstantDetails(latestInstant).get(),
+            HoodieCommitMetadata.class);
+        if 
(mdtCommitMetadata.getPartitionToWriteStats().containsKey(MetadataPartitionType.COLUMN_STATS.getPartitionPath()))
 {

Review Comment:
   This check is costly. If not for this check, then we don't need 
`mdtCommitMetadata` and hence don't need to create mdtMetaClient and 
deserialize the instant. If we strictly need this check, please consider 
caching the meta client or reusing an existing instance if possible.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to