nsivabalan commented on code in PR #12310:
URL: https://github.com/apache/hudi/pull/12310#discussion_r1854572540


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -173,6 +173,14 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Comma-separated list of columns for which column 
stats index will be built. If not set, all columns will be indexed");
 
+  public static final ConfigProperty<Integer> COLUMN_STATS_INDEX_MAX_COLUMNS = 
ConfigProperty

Review Comment:
   can we do 
   ```
   METADATA_PREFIX + ".index.column.stats.max.columns.to.index"
   ```



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -413,6 +421,10 @@ public List<String> getColumnsEnabledForColumnStatsIndex() 
{
     return StringUtils.split(getString(COLUMN_STATS_INDEX_FOR_COLUMNS), 
CONFIG_VALUES_DELIMITER);
   }
 
+  public Integer getMaximumColumnsForColumnStats() {

Review Comment:
   once you fix the config key, do change the getter method names accordingly. 



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1193,20 +1200,20 @@ public static HoodieData<HoodieRecord> 
convertMetadataToColumnStatsRecords(Hoodi
    */
   private static List<String> getColumnsToIndex(boolean 
isColumnStatsIndexEnabled,
                                                 List<String> 
targetColumnsForColumnStatsIndex,
+                                                int maxColumnsToIndex,
                                                 Lazy<Option<Schema>> 
lazyWriterSchemaOpt) {
     checkState(isColumnStatsIndexEnabled);
 
     if (!targetColumnsForColumnStatsIndex.isEmpty()) {
       return targetColumnsForColumnStatsIndex;
+    } else {
+      if (maxColumnsToIndex <= 0) {
+        return Collections.emptyList();
+      }
+      return lazyWriterSchemaOpt.get()
+          .map(tableSchema -> getFirstNFieldNames(tableSchema, 
maxColumnsToIndex))

Review Comment:
   and should we add 5 to the value that user sets to account for meta fields? 



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2525,4 +2536,31 @@ Map<String, Long> getFileNameToSizeMap() {
       return filenameToSizeMap;
     }
   }
+
+  public static List<String> getColumnsToIndex(HoodieMetadataConfig 
metadataConfig, Schema tableSchema) {

Review Comment:
   why do we have multiple methods to deduce the columns to index?



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2525,4 +2536,31 @@ Map<String, Long> getFileNameToSizeMap() {
       return filenameToSizeMap;
     }
   }
+
+  public static List<String> getColumnsToIndex(HoodieMetadataConfig 
metadataConfig, Schema tableSchema) {
+    List<String> userSetStrings = 
metadataConfig.getColumnsEnabledForColumnStatsIndex();

Review Comment:
   userSetStrings -> userConfiguredColsToIndex 



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -367,6 +367,7 @@ public static Map<String, HoodieData<HoodieRecord>> 
convertMetadataToRecords(Hoo
                                                                                
boolean isColumnStatsIndexEnabled,
                                                                                
int columnStatsIndexParallelism,
                                                                                
List<String> targetColumnsForColumnStatsIndex,
+                                                                               
int maxColsToIndex,

Review Comment:
   why can't we deduce the list of columns at higher layer itself. 
   in other words, why do we need both List<String> 
targetColumnsForColumnStatsIndex, and int maxColsToIndex. 
   



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala:
##########
@@ -67,6 +67,29 @@ class TestColumnStatsIndexWithSQL extends 
ColumnStatIndexTestBase {
     verifyFileIndexAndSQLQueries(commonOpts)
   }
 
+  @ParameterizedTest
+  @MethodSource(Array("testMetadataColumnStatsIndexParams"))
+  def testMetadataColumnStatsIndexWithSQLWithLimitedIndexes(testCase: 
ColumnStatsTestCase): Unit = {
+    val metadataOpts = Map(
+      HoodieMetadataConfig.ENABLE.key -> "true",
+      HoodieMetadataConfig.ENABLE_METADATA_INDEX_COLUMN_STATS.key -> "true",
+      HoodieMetadataConfig.COLUMN_STATS_INDEX_MAX_COLUMNS.key() -> "8"

Review Comment:
   once we add +5 as above feedback, I guess you have to reduce this value to 3.



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1193,20 +1200,20 @@ public static HoodieData<HoodieRecord> 
convertMetadataToColumnStatsRecords(Hoodi
    */
   private static List<String> getColumnsToIndex(boolean 
isColumnStatsIndexEnabled,
                                                 List<String> 
targetColumnsForColumnStatsIndex,
+                                                int maxColumnsToIndex,
                                                 Lazy<Option<Schema>> 
lazyWriterSchemaOpt) {
     checkState(isColumnStatsIndexEnabled);
 
     if (!targetColumnsForColumnStatsIndex.isEmpty()) {
       return targetColumnsForColumnStatsIndex;
+    } else {
+      if (maxColumnsToIndex <= 0) {
+        return Collections.emptyList();

Review Comment:
   how can maxColumnsToIndex < 0?
   if some user mis-configures to have a value of 0 or less? 
   



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2525,4 +2536,31 @@ Map<String, Long> getFileNameToSizeMap() {
       return filenameToSizeMap;
     }
   }
+
+  public static List<String> getColumnsToIndex(HoodieMetadataConfig 
metadataConfig, Schema tableSchema) {
+    List<String> userSetStrings = 
metadataConfig.getColumnsEnabledForColumnStatsIndex();
+    if (!userSetStrings.isEmpty()) {
+      return userSetStrings;
+    } else {
+      return getFirstNFieldNames(tableSchema, 
metadataConfig.getMaximumColumnsForColumnStats());

Review Comment:
   again, +5? same comment as above



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndexWithSQL.scala:
##########
@@ -204,33 +227,42 @@ class TestColumnStatsIndexWithSQL extends 
ColumnStatIndexTestBase {
     verifyFileIndexAndSQLQueries(commonOpts)
   }
 
-  private def setupTable(testCase: ColumnStatsTestCase, metadataOpts: 
Map[String, String], commonOpts: Map[String, String], shouldValidate: Boolean): 
Unit = {
+  private def setupTable(testCase: ColumnStatsTestCase, metadataOpts: 
Map[String, String], commonOpts: Map[String, String],

Review Comment:
   can you add another simple test (even parametrize an existing test). where 
you set numColsToIndex to say 100. expected behavior should be same as not 
setting any explicit value for the config of interest. 
   



##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -1193,20 +1200,20 @@ public static HoodieData<HoodieRecord> 
convertMetadataToColumnStatsRecords(Hoodi
    */
   private static List<String> getColumnsToIndex(boolean 
isColumnStatsIndexEnabled,
                                                 List<String> 
targetColumnsForColumnStatsIndex,
+                                                int maxColumnsToIndex,
                                                 Lazy<Option<Schema>> 
lazyWriterSchemaOpt) {
     checkState(isColumnStatsIndexEnabled);
 
     if (!targetColumnsForColumnStatsIndex.isEmpty()) {
       return targetColumnsForColumnStatsIndex;
+    } else {

Review Comment:
   So, is getColumnsToIndex) at L 1201 is the only method all callers use to 
deduce the list of fields to index for col stats? 
   



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -413,6 +421,10 @@ public List<String> getColumnsEnabledForColumnStatsIndex() 
{
     return StringUtils.split(getString(COLUMN_STATS_INDEX_FOR_COLUMNS), 
CONFIG_VALUES_DELIMITER);
   }
 
+  public Integer getMaximumColumnsForColumnStats() {

Review Comment:
   we could rename this to 
   maxColumnsToIndexForColStats 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to