Re: [PR] fix: Handle nested map and array columns in MDT [hudi]

via GitHub Thu, 15 Jan 2026 07:53:40 -0800


the-other-tim-brown commented on code in PR #17694:
URL: https://github.com/apache/hudi/pull/17694#discussion_r2694925821



##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchemaUtils.java:
##########
@@ -59,6 +59,41 @@ public final class HoodieSchemaUtils {
   public static final HoodieSchema METADATA_FIELD_SCHEMA = 
HoodieSchema.createNullable(HoodieSchemaType.STRING);
   public static final HoodieSchema RECORD_KEY_SCHEMA = initRecordKeySchema();
 
+  /**
+   * Constants for Parquet-style accessor patterns used in nested MAP and 
ARRAY navigation.
+   * These patterns are specifically used for column stats generation and 
differ from
+   * InternalSchema constants which are used in schema evolution contexts.
+   */
+  private static final String ARRAY_LIST = "list";
+  private static final String ARRAY_ELEMENT = "element";
+  private static final String ARRAY_SPARK = "array"; // Spark writer uses this

Review Comment:
   Let's standardize on HoodieSchema instead of spark vs avro.



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##########
@@ -46,24 +46,226 @@ import scala.collection.JavaConverters._
 object HoodieSchemaUtils {
   private val log = LoggerFactory.getLogger(getClass)
 
-  def getSchemaForField(schema: StructType, fieldName: String): 
org.apache.hudi.common.util.collection.Pair[String, StructField] = {
-    getSchemaForField(schema, fieldName, StringUtils.EMPTY_STRING)
+  /**
+   * Constants for Parquet-style accessor patterns used in nested MAP and 
ARRAY navigation.
+   * These patterns are specifically used for column stats generation and 
differ from
+   * InternalSchema constants which are used in schema evolution contexts.
+   */
+  private final val ARRAY_LIST = "list"

Review Comment:
   I don't like the approach of having two different implementations of 
essentially the same search logic. Let's have all schema related operations go 
through HoodieSchema.



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala:
##########
@@ -259,6 +259,221 @@ class TestColumnStatsIndex extends 
ColumnStatIndexTestBase {
       addNestedFiled = true)
   }
 
+  /**
+   * Tests data skipping with nested MAP and ARRAY fields in column stats 
index.
+   * This test verifies that queries can efficiently skip files based on 
nested field values
+   * within MAP and ARRAY types using the new Parquet-style accessor patterns.
+   */
+  @ParameterizedTest

Review Comment:
   We'll want to test 2 and 3 level lists here to ensure compatibility 



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -219,7 +219,11 @@ public final class HoodieMetadataConfig extends 
HoodieConfig {
       .noDefaultValue()
       .markAdvanced()
       .sinceVersion("0.11.0")
-      .withDocumentation("Comma-separated list of columns for which column 
stats index will be built. If not set, all columns will be indexed");
+      .withDocumentation("Comma-separated list of columns for which column 
stats index will be built. If not set, all columns will be indexed. "
+          + "For nested fields within ARRAY types, use: field.list.element 
(Avro convention) or field.array (Spark convention) "

Review Comment:
   Should we limit the input to a single format to avoid confusion?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Handle nested map and array columns in MDT [hudi]

Reply via email to