Re: [PR] [ENG-38911] Test hudi rs with hudi functional tests [hudi]

via GitHub Tue, 14 Apr 2026 15:22:01 -0700


yihua commented on code in PR #18455:
URL: https://github.com/apache/hudi/pull/18455#discussion_r3082832377



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -583,6 +576,9 @@ class TestMORDataSource extends HoodieSparkClientTestBase 
with SparkDatasetMixin
     // First Operation:
     // Producing parquet files to three default partitions.
     // SNAPSHOT view on MOR table with parquet files only.

Review Comment:
   🤖 These commented-out lines (`val BASE_OUTPUT_PATH`, `val tablePath`) look 
like leftover debug code. Could you remove them?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
pom.xml:
##########
@@ -170,7 +170,7 @@
     <rocksdbjni.version>7.5.3</rocksdbjni.version>
     <spark33.version>3.3.4</spark33.version>
     <spark34.version>3.4.3</spark34.version>
-    <spark35.version>3.5.5</spark35.version>
+    <spark35.version>3.5.2-INTERNAL-0.16</spark35.version>

Review Comment:
   🤖 This changes the Spark version from a public Apache release (`3.5.5`) to 
what appears to be a private internal build (`3.5.2-INTERNAL-0.16`). This would 
break builds for anyone without access to the internal repository and is not 
appropriate for the Apache Hudi mainline. Could you revert this and keep the 
Gluten integration working with a standard Spark version?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
RCA_Gluten_Velox_MOR_Failures.md:
##########
@@ -0,0 +1,373 @@
+# RCA: Gluten/Velox MOR Test Failures

Review Comment:
   🤖 This is an excellent root-cause analysis document, but should it live in 
the repo root? It seems more suited for a wiki page, Jira ticket, or the PR 
description itself. If it does belong in the repo, a subdirectory like `docs/` 
or `dev/` might be more appropriate.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -48,7 +50,7 @@ class TestMergeModeCommitTimeOrdering extends 
HoodieSparkSqlTestBase {
     val storage = HoodieTestUtils.getDefaultStorage
     val mergeConfigClause = if (setRecordMergeConfigs) {
       // with precombine field set, UPSERT operation is used automatically
-      if (tableVersion.toInt == 6) {
+    if (tableVersion.toInt == 6) {

Review Comment:
   🤖 This looks like an accidental indentation change (the `if` was de-indented 
by one level). Might cause confusion on future diffs.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -103,6 +103,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with 
SparkVersionsSupport wi
     //       Additionally, we have to explicitly wrap around resulting [[RDD]] 
into the one
     //       injecting [[SQLConf]], which by default isn't propagated by Spark 
to the executor(s).
     //       [[SQLConf]] is required by [[AvroSerializer]]
+    logWarning(s"createRdd 
executedPlan:\n${df.queryExecution.executedPlan.treeString}")

Review Comment:
   🤖 This `logWarning` dumps the full executed plan on every `createRdd` call 
in production code. This is a debug statement that should be removed before 
merging — it will produce excessive log output in production workloads and may 
leak schema details into logs.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
pom.xml:
##########
@@ -1963,6 +1963,11 @@
       <id>confluent</id>
       <url>https://packages.confluent.io/maven/</url>
     </repository>
+    <repository>
+      <id>codeartifact</id>
+      <name>codeartifact</name>

Review Comment:
   🤖 Adding a private CodeArtifact repository 
(`onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com`) to the root 
POM would force all Hudi builders to have access to this internal repo. This 
should not be in the mainline POM — it belongs in a local `settings.xml` or a 
separate profile.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -100,8 +102,11 @@ class TestMergeModeCommitTimeOrdering extends 
HoodieSparkSqlTestBase {
         // TODO(HUDI-8820): enable MDT after supporting MDT with table version 
6
         ("hoodie.metadata.enable" -> "false", tableVersion.toInt == 6)
       ) {
-        withRecordType()(withTempDir { tmp =>
+        withRecordType()({
           val tableName = generateTableName
+          val currTime = System.currentTimeMillis()
+          val tablePath = 
s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"

Review Comment:
   🤖 Same issue — hardcoded path `/home/ubuntu/ws3/hudi-rs/tables/table...` 
that only works on a specific machine. The original code used `withTempDir { 
tmp => ... }` which was correct; this change also removes the temp directory 
cleanup. The mutable `cnt` counter in the companion object is also not 
thread-safe if tests run in parallel.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -77,7 +77,7 @@ class TestMORDataSource extends HoodieSparkClientTestBase 
with SparkDatasetMixin
   )
   val sparkOpts = Map(
     HoodieWriteConfig.RECORD_MERGE_IMPL_CLASSES.key -> 
classOf[DefaultSparkRecordMerger].getName,
-    HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet"
+    HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro"

Review Comment:
   🤖 Changing the default log block format from `parquet` to `avro` globally in 
this test class changes what's being tested for all runs, not just Gluten runs. 
Parquet log blocks are a supported and commonly used configuration that should 
continue to be tested. Could you keep `parquet` as default and only switch to 
`avro` when Gluten is active?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -28,13 +28,15 @@ import org.apache.hudi.common.testutils.HoodieTestUtils
 import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
 import 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.validateTableConfig
 
+object TestMergeModeCommitTimeOrdering {
+  var cnt: Int = 0
+}
+
 class TestMergeModeCommitTimeOrdering extends HoodieSparkSqlTestBase {
 
   Seq(
-    "cow,current,false,false", "cow,current,false,true", 
"cow,current,true,false",
-    "mor,current,false,false", "mor,current,false,true", 
"mor,current,true,false",
-    "cow,6,true,false", "cow,6,true,true", "mor,6,true,true",
-    "cow,8,true,false", "cow,8,true,true", "mor,8,true,true").foreach { args =>
+    // "cow,current,false,false", "cow,current,false,true", 
"cow,current,true,false",
+    "mor,current,false,false", "mor,current,false,true", 
"mor,current,true,false").foreach { args =>
     val argList = args.split(',')
     val tableType = argList(0)

Review Comment:
   🤖 COW table type tests have been commented out (`cow,current,...`), which 
removes test coverage for Copy-on-Write tables with commit time ordering. This 
is a core Hudi feature. Was this intentional, or leftover from debugging?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala:
##########
@@ -30,32 +31,35 @@ import org.apache.hudi.common.schema.HoodieSchema
 import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
 import 
org.apache.hudi.common.table.read.TestHoodieFileGroupReaderBase.{hoodieRecordsToIndexedRecords,
 supportedFileFormats}
 import 
org.apache.hudi.common.table.read.TestHoodieFileGroupReaderOnSpark.getFileCount
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
 import org.apache.hudi.common.testutils.{HoodieTestDataGenerator, 
HoodieTestUtils}
 import org.apache.hudi.common.util.{Option => HOption, OrderingValues}
 import org.apache.hudi.config.{HoodieCompactionConfig, HoodieWriteConfig}
 import org.apache.hudi.storage.{StorageConfiguration, StoragePath}
-import org.apache.hudi.testutils.SparkClientFunctionalTestHarness
+import org.apache.hudi.testutils.{GlutenTestUtils, 
SparkClientFunctionalTestHarness}
 
 import org.apache.avro.{Schema, SchemaBuilder}
 import org.apache.avro.generic.GenericRecord
 import org.apache.hadoop.conf.Configuration
 import org.apache.spark.{HoodieSparkKryoRegistrar, SparkConf}
+import org.apache.spark.api.java.JavaSparkContext
 import org.apache.spark.sql.{Dataset, HoodieInternalRowUtils, Row, SaveMode, 
SparkSession}
 import org.apache.spark.sql.avro.HoodieSparkSchemaConverters
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
 import org.apache.spark.sql.execution.datasources.SparkColumnarFileReader
 import org.apache.spark.sql.hudi.MultipleColumnarFileFormatReader

Review Comment:
   🤖 The `@Disabled` annotations on tests like `testCustomDelete` use reasons 
like "Custom delete payload not supported" — but these features ARE supported 
by Hudi. They're only unsupported by Gluten/Velox. If this PR is meant to run 
alongside normal Hudi CI, these disabled tests would silently skip important 
coverage. Consider using `@EnabledIf` with a condition that checks for Gluten 
presence.
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java:
##########
@@ -113,7 +113,7 @@
  * Tests {@link HoodieFileGroupReader} with different engines
  */
 public abstract class TestHoodieFileGroupReaderBase<T> {
-  private static final List<HoodieFileFormat> DEFAULT_SUPPORTED_FILE_FORMATS = 
Arrays.asList(HoodieFileFormat.PARQUET, HoodieFileFormat.ORC);
+  private static final List<HoodieFileFormat> DEFAULT_SUPPORTED_FILE_FORMATS = 
Arrays.asList(HoodieFileFormat.PARQUET);

Review Comment:
   🤖 Removing ORC from `DEFAULT_SUPPORTED_FILE_FORMATS` and commenting out 
CUSTOM merge mode tests significantly reduces test coverage for all users, not 
just Gluten runs. These changes affect the base test class used by multiple 
engines. Could you gate these reductions behind the `gluten.bundle.jar` 
property instead of removing them unconditionally?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala:
##########
@@ -0,0 +1,823 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.hudi.dml.others
+
+import org.apache.hudi.HoodieCLIUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.table.log.HoodieLogFileReader
+import 
org.apache.hudi.common.table.log.block.HoodieLogBlock.{HeaderMetadataType, 
HoodieLogBlockType}
+import org.apache.hudi.common.testutils.HoodieTestUtils
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.testutils.DataSourceTestUtils
+
+import org.apache.hadoop.fs.Path
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+import 
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.getMetaClientAndFileSystemView
+
+import java.util.Collections
+
+import scala.collection.JavaConverters._
+
+/**
+ * Creates 4 MOR tables with specific file slice layouts, validates each layout
+ * via the same HoodieTableFileSystemView + getLatestMergedFileSlicesBeforeOrOn
+ * code path that SELECT uses, then saves gold data alongside each table.
+ */
+class TestMORFileSliceLayouts extends HoodieSparkSqlTestBase {
+
+  val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss"
+
+  private def cleanPath(path: String): Unit = {
+    val hadoopPath = new Path(path)
+    val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())

Review Comment:
   🤖 This hardcodes an absolute path (`/home/ubuntu/ws3/hudi-rs/tables/fdss`) 
that only exists on a specific developer machine. Tests will fail or write to 
unexpected locations on any other system. Could you use `withTempDir` or 
`getBasePath` instead?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -108,27 +108,18 @@ class TestMORDataSource extends HoodieSparkClientTestBase 
with SparkDatasetMixin
   @CsvSource(Array(
     // Inferred as COMMIT_TIME_ORDERING
     "AVRO, AVRO, avro, false,,,CURRENT", "AVRO, SPARK, parquet, 
false,,,CURRENT",
-    "SPARK, AVRO, parquet, false,,,CURRENT", "SPARK, SPARK, parquet, 
false,,,CURRENT",
-    "AVRO, AVRO, avro, false,,,EIGHT", "AVRO, SPARK, parquet, false,,,EIGHT",
-    "SPARK, AVRO, parquet, false,,,EIGHT", "SPARK, SPARK, parquet, 
false,,,EIGHT",
-    // EVENT_TIME_ORDERING without precombine field
-    "AVRO, AVRO, avro, false,,EVENT_TIME_ORDERING,CURRENT", "AVRO, SPARK, 
parquet, false,,EVENT_TIME_ORDERING,CURRENT",
-    "SPARK, AVRO, parquet, false,,EVENT_TIME_ORDERING,CURRENT", "SPARK, SPARK, 
parquet, false,,EVENT_TIME_ORDERING,CURRENT",
-    "AVRO, AVRO, avro, false,,EVENT_TIME_ORDERING,EIGHT", "AVRO, SPARK, 
parquet, false,,EVENT_TIME_ORDERING,EIGHT",
-    "SPARK, AVRO, parquet, false,,EVENT_TIME_ORDERING,EIGHT", "SPARK, SPARK, 
parquet, false,,EVENT_TIME_ORDERING,EIGHT",
-    // EVENT_TIME_ORDERING with empty precombine field
-    "AVRO, AVRO, avro, true,,EVENT_TIME_ORDERING,CURRENT", "AVRO, SPARK, 
parquet, true,,EVENT_TIME_ORDERING,CURRENT",
-    "SPARK, AVRO, parquet, true,,EVENT_TIME_ORDERING,CURRENT", "SPARK, SPARK, 
parquet, true,,EVENT_TIME_ORDERING,CURRENT",
-    "AVRO, AVRO, avro, true,,EVENT_TIME_ORDERING,EIGHT", "AVRO, SPARK, 
parquet, true,,EVENT_TIME_ORDERING,EIGHT",
-    "SPARK, AVRO, parquet, true,,EVENT_TIME_ORDERING,EIGHT", "SPARK, SPARK, 
parquet, true,,EVENT_TIME_ORDERING,EIGHT",
-    // Inferred as EVENT_TIME_ORDERING
-    "AVRO, AVRO, avro, true, timestamp,,CURRENT", "AVRO, SPARK, parquet, true, 
timestamp,,CURRENT",
-    "SPARK, AVRO, parquet, true, timestamp,,CURRENT", "SPARK, SPARK, parquet, 
true, timestamp,,CURRENT",
-    "AVRO, AVRO, avro, true, timestamp,,EIGHT", "AVRO, SPARK, parquet, true, 
timestamp,,EIGHT",
-    "SPARK, AVRO, parquet, true, timestamp,,EIGHT", "SPARK, SPARK, parquet, 
true, timestamp,,EIGHT",
+    // "SPARK, AVRO, parquet, false,,,CURRENT", "SPARK, SPARK, parquet, 
false,,,CURRENT",

Review Comment:
   🤖 A large number of parameterized test cases have been commented out rather 
than removed or conditionally skipped. This removes coverage for SPARK record 
type, EVENT_TIME_ORDERING, table versions 6 and 8, and more. These are 
important code paths. If they're incompatible with Gluten, could you skip them 
conditionally (e.g., via `@EnabledIf`) rather than commenting them out?
   
   <sub><i>- Generated by an AI agent and may contain mistakes. Please verify 
any suggestions before applying.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [ENG-38911] Test hudi rs with hudi functional tests [hudi]

Reply via email to