wombatu-kun commented on code in PR #18807:
URL: https://github.com/apache/hudi/pull/18807#discussion_r3408896160


##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestColStatsAutoExtendOnAddColumn.scala:
##########
@@ -0,0 +1,289 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.functional
+
+import org.apache.hudi.{ColumnStatsIndexSupport, DataSourceWriteOptions, 
HoodieSchemaConversionUtils, ScalaAssertionSupport, SparkAdapterSupport}
+import org.apache.hudi.HoodieConversionUtils.{toJavaOption, toProperties}
+import org.apache.hudi.common.config.{HoodieMetadataConfig, RecordMergeMode}
+import org.apache.hudi.common.model.HoodieTableType
+import org.apache.hudi.common.table.HoodieTableConfig
+import org.apache.hudi.common.util.Option
+import org.apache.hudi.config.HoodieWriteConfig
+import org.apache.hudi.testutils.HoodieSparkClientTestBase
+import org.apache.hudi.util.JFunction
+
+import org.apache.spark.sql.{Row, SaveMode, SparkSession, 
SparkSessionExtensions}
+import org.apache.spark.sql.hudi.HoodieSparkSessionExtension
+import org.apache.spark.sql.types.{IntegerType, LongType, StringType, 
StructField, StructType}
+import org.junit.jupiter.api.{BeforeEach, Tag}
+import org.junit.jupiter.api.Assertions.{assertEquals, assertTrue}
+import org.junit.jupiter.params.ParameterizedTest
+import org.junit.jupiter.params.provider.CsvSource
+
+import java.util.function.Consumer
+
+import scala.collection.JavaConverters._
+
+/**
+ * Tests for: when a new column is added to a Hudi table via schema evolution,
+ * how is it treated by the MDT column_stats index?
+ *
+ * This is item #2 from a CDC schema-evolution improvements roadmap. The 
original
+ * recommendation assumed Hudi never auto-extends col_stats to new columns. The
+ * tests below confirm the actual behavior is more nuanced — it depends on
+ * whether the user set an explicit {@code 
hoodie.metadata.index.column.stats.column.list}.
+ *
+ * Two scenarios:
+ *
+ *   (A) No explicit column.list — default "index all eligible columns" mode.
+ *       Test: {@link #testNewColumnInDefaultModeIsAutoIndexed}.
+ *       Expected and observed: the new column IS auto-indexed in subsequent
+ *       commits. Pre-evolution files have null stats for the new column (the
+ *       column didn't exist when those files were written); post-evolution
+ *       files have stats. Working as intended.
+ *
+ *   (B) Explicit column.list that doesn't include the new column.
+ *       Test: {@link #testNewColumnWithExplicitListIsNotAutoIndexed}.
+ *       Expected: the new column should NOT be auto-indexed (the user opted 
into
+ *       an explicit list). Confirmed.
+ *
+ * The user-facing implication, and the question this PR is opening for review:
+ *
+ *   - For users on the default mode, no action is needed when a column is 
added
+ *     via schema evolution. col_stats auto-extends.
+ *   - For users with an explicit column.list, a column added at the source is
+ *     silently NOT indexed in Hudi unless they remember to extend the list.
+ *     Data-skipping queries on the new column will silently fall back to
+ *     full-file scans.
+ *
+ * Question for reviewers: should there be a config option that lets users have

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to