the-other-tim-brown commented on code in PR #17694:
URL: https://github.com/apache/hudi/pull/17694#discussion_r2694925821
##########
hudi-common/src/main/java/org/apache/hudi/common/schema/HoodieSchemaUtils.java:
##########
@@ -59,6 +59,41 @@ public final class HoodieSchemaUtils {
public static final HoodieSchema METADATA_FIELD_SCHEMA =
HoodieSchema.createNullable(HoodieSchemaType.STRING);
public static final HoodieSchema RECORD_KEY_SCHEMA = initRecordKeySchema();
+ /**
+ * Constants for Parquet-style accessor patterns used in nested MAP and
ARRAY navigation.
+ * These patterns are specifically used for column stats generation and
differ from
+ * InternalSchema constants which are used in schema evolution contexts.
+ */
+ private static final String ARRAY_LIST = "list";
+ private static final String ARRAY_ELEMENT = "element";
+ private static final String ARRAY_SPARK = "array"; // Spark writer uses this
Review Comment:
Let's standardize on HoodieSchema instead of spark vs avro.
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##########
@@ -46,24 +46,226 @@ import scala.collection.JavaConverters._
object HoodieSchemaUtils {
private val log = LoggerFactory.getLogger(getClass)
- def getSchemaForField(schema: StructType, fieldName: String):
org.apache.hudi.common.util.collection.Pair[String, StructField] = {
- getSchemaForField(schema, fieldName, StringUtils.EMPTY_STRING)
+ /**
+ * Constants for Parquet-style accessor patterns used in nested MAP and
ARRAY navigation.
+ * These patterns are specifically used for column stats generation and
differ from
+ * InternalSchema constants which are used in schema evolution contexts.
+ */
+ private final val ARRAY_LIST = "list"
Review Comment:
I don't like the approach of having two different implementations of
essentially the same search logic. Let's have all schema related operations go
through HoodieSchema.
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColumnStatsIndex.scala:
##########
@@ -259,6 +259,221 @@ class TestColumnStatsIndex extends
ColumnStatIndexTestBase {
addNestedFiled = true)
}
+ /**
+ * Tests data skipping with nested MAP and ARRAY fields in column stats
index.
+ * This test verifies that queries can efficiently skip files based on
nested field values
+ * within MAP and ARRAY types using the new Parquet-style accessor patterns.
+ */
+ @ParameterizedTest
Review Comment:
We'll want to test 2 and 3 level lists here to ensure compatibility
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -219,7 +219,11 @@ public final class HoodieMetadataConfig extends
HoodieConfig {
.noDefaultValue()
.markAdvanced()
.sinceVersion("0.11.0")
- .withDocumentation("Comma-separated list of columns for which column
stats index will be built. If not set, all columns will be indexed");
+ .withDocumentation("Comma-separated list of columns for which column
stats index will be built. If not set, all columns will be indexed. "
+ + "For nested fields within ARRAY types, use: field.list.element
(Avro convention) or field.array (Spark convention) "
Review Comment:
Should we limit the input to a single format to avoid confusion?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]