yihua commented on code in PR #18455:
URL: https://github.com/apache/hudi/pull/18455#discussion_r3082832377
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -583,6 +576,9 @@ class TestMORDataSource extends HoodieSparkClientTestBase
with SparkDatasetMixin
// First Operation:
// Producing parquet files to three default partitions.
// SNAPSHOT view on MOR table with parquet files only.
Review Comment:
🤖 These commented-out lines (`val BASE_OUTPUT_PATH`, `val tablePath`) look
like leftover debug code. Could you remove them?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
pom.xml:
##########
@@ -170,7 +170,7 @@
<rocksdbjni.version>7.5.3</rocksdbjni.version>
<spark33.version>3.3.4</spark33.version>
<spark34.version>3.4.3</spark34.version>
- <spark35.version>3.5.5</spark35.version>
+ <spark35.version>3.5.2-INTERNAL-0.16</spark35.version>
Review Comment:
🤖 This changes the Spark version from a public Apache release (`3.5.5`) to
what appears to be a private internal build (`3.5.2-INTERNAL-0.16`). This would
break builds for anyone without access to the internal repository and is not
appropriate for the Apache Hudi mainline. Could you revert this and keep the
Gluten integration working with a standard Spark version?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
RCA_Gluten_Velox_MOR_Failures.md:
##########
@@ -0,0 +1,373 @@
+# RCA: Gluten/Velox MOR Test Failures
Review Comment:
🤖 This is an excellent root-cause analysis document, but should it live in
the repo root? It seems more suited for a wiki page, Jira ticket, or the PR
description itself. If it does belong in the repo, a subdirectory like `docs/`
or `dev/` might be more appropriate.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -48,7 +50,7 @@ class TestMergeModeCommitTimeOrdering extends
HoodieSparkSqlTestBase {
val storage = HoodieTestUtils.getDefaultStorage
val mergeConfigClause = if (setRecordMergeConfigs) {
// with precombine field set, UPSERT operation is used automatically
- if (tableVersion.toInt == 6) {
+ if (tableVersion.toInt == 6) {
Review Comment:
🤖 This looks like an accidental indentation change (the `if` was de-indented
by one level). Might cause confusion on future diffs.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieSparkUtils.scala:
##########
@@ -103,6 +103,7 @@ object HoodieSparkUtils extends SparkAdapterSupport with
SparkVersionsSupport wi
// Additionally, we have to explicitly wrap around resulting [[RDD]]
into the one
// injecting [[SQLConf]], which by default isn't propagated by Spark
to the executor(s).
// [[SQLConf]] is required by [[AvroSerializer]]
+ logWarning(s"createRdd
executedPlan:\n${df.queryExecution.executedPlan.treeString}")
Review Comment:
🤖 This `logWarning` dumps the full executed plan on every `createRdd` call
in production code. This is a debug statement that should be removed before
merging — it will produce excessive log output in production workloads and may
leak schema details into logs.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
pom.xml:
##########
@@ -1963,6 +1963,11 @@
<id>confluent</id>
<url>https://packages.confluent.io/maven/</url>
</repository>
+ <repository>
+ <id>codeartifact</id>
+ <name>codeartifact</name>
Review Comment:
🤖 Adding a private CodeArtifact repository
(`onehouse-194159489498.d.codeartifact.us-west-2.amazonaws.com`) to the root
POM would force all Hudi builders to have access to this internal repo. This
should not be in the mainline POM — it belongs in a local `settings.xml` or a
separate profile.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -100,8 +102,11 @@ class TestMergeModeCommitTimeOrdering extends
HoodieSparkSqlTestBase {
// TODO(HUDI-8820): enable MDT after supporting MDT with table version
6
("hoodie.metadata.enable" -> "false", tableVersion.toInt == 6)
) {
- withRecordType()(withTempDir { tmp =>
+ withRecordType()({
val tableName = generateTableName
+ val currTime = System.currentTimeMillis()
+ val tablePath =
s"/home/ubuntu/ws3/hudi-rs/tables/table${TestMergeModeCommitTimeOrdering.cnt}_$currTime"
Review Comment:
🤖 Same issue — hardcoded path `/home/ubuntu/ws3/hudi-rs/tables/table...`
that only works on a specific machine. The original code used `withTempDir {
tmp => ... }` which was correct; this change also removes the temp directory
cleanup. The mutable `cnt` counter in the companion object is also not
thread-safe if tests run in parallel.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -77,7 +77,7 @@ class TestMORDataSource extends HoodieSparkClientTestBase
with SparkDatasetMixin
)
val sparkOpts = Map(
HoodieWriteConfig.RECORD_MERGE_IMPL_CLASSES.key ->
classOf[DefaultSparkRecordMerger].getName,
- HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "parquet"
+ HoodieStorageConfig.LOGFILE_DATA_BLOCK_FORMAT.key -> "avro"
Review Comment:
🤖 Changing the default log block format from `parquet` to `avro` globally in
this test class changes what's being tested for all runs, not just Gluten runs.
Parquet log blocks are a supported and commonly used configuration that should
continue to be tested. Could you keep `parquet` as default and only switch to
`avro` when Gluten is active?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMergeModeCommitTimeOrdering.scala:
##########
@@ -28,13 +28,15 @@ import org.apache.hudi.common.testutils.HoodieTestUtils
import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
import
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.validateTableConfig
+object TestMergeModeCommitTimeOrdering {
+ var cnt: Int = 0
+}
+
class TestMergeModeCommitTimeOrdering extends HoodieSparkSqlTestBase {
Seq(
- "cow,current,false,false", "cow,current,false,true",
"cow,current,true,false",
- "mor,current,false,false", "mor,current,false,true",
"mor,current,true,false",
- "cow,6,true,false", "cow,6,true,true", "mor,6,true,true",
- "cow,8,true,false", "cow,8,true,true", "mor,8,true,true").foreach { args =>
+ // "cow,current,false,false", "cow,current,false,true",
"cow,current,true,false",
+ "mor,current,false,false", "mor,current,false,true",
"mor,current,true,false").foreach { args =>
val argList = args.split(',')
val tableType = argList(0)
Review Comment:
🤖 COW table type tests have been commented out (`cow,current,...`), which
removes test coverage for Copy-on-Write tables with commit time ordering. This
is a core Hudi feature. Was this intentional, or leftover from debugging?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderOnSpark.scala:
##########
@@ -30,32 +31,35 @@ import org.apache.hudi.common.schema.HoodieSchema
import org.apache.hudi.common.table.{HoodieTableConfig, HoodieTableMetaClient}
import
org.apache.hudi.common.table.read.TestHoodieFileGroupReaderBase.{hoodieRecordsToIndexedRecords,
supportedFileFormats}
import
org.apache.hudi.common.table.read.TestHoodieFileGroupReaderOnSpark.getFileCount
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
import org.apache.hudi.common.testutils.{HoodieTestDataGenerator,
HoodieTestUtils}
import org.apache.hudi.common.util.{Option => HOption, OrderingValues}
import org.apache.hudi.config.{HoodieCompactionConfig, HoodieWriteConfig}
import org.apache.hudi.storage.{StorageConfiguration, StoragePath}
-import org.apache.hudi.testutils.SparkClientFunctionalTestHarness
+import org.apache.hudi.testutils.{GlutenTestUtils,
SparkClientFunctionalTestHarness}
import org.apache.avro.{Schema, SchemaBuilder}
import org.apache.avro.generic.GenericRecord
import org.apache.hadoop.conf.Configuration
import org.apache.spark.{HoodieSparkKryoRegistrar, SparkConf}
+import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.sql.{Dataset, HoodieInternalRowUtils, Row, SaveMode,
SparkSession}
import org.apache.spark.sql.avro.HoodieSparkSchemaConverters
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.util.{ArrayData, MapData}
import org.apache.spark.sql.execution.datasources.SparkColumnarFileReader
import org.apache.spark.sql.hudi.MultipleColumnarFileFormatReader
Review Comment:
🤖 The `@Disabled` annotations on tests like `testCustomDelete` use reasons
like "Custom delete payload not supported" — but these features ARE supported
by Hudi. They're only unsupported by Gluten/Velox. If this PR is meant to run
alongside normal Hudi CI, these disabled tests would silently skip important
coverage. Consider using `@EnabledIf` with a condition that checks for Gluten
presence.
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-common/src/test/java/org/apache/hudi/common/table/read/TestHoodieFileGroupReaderBase.java:
##########
@@ -113,7 +113,7 @@
* Tests {@link HoodieFileGroupReader} with different engines
*/
public abstract class TestHoodieFileGroupReaderBase<T> {
- private static final List<HoodieFileFormat> DEFAULT_SUPPORTED_FILE_FORMATS =
Arrays.asList(HoodieFileFormat.PARQUET, HoodieFileFormat.ORC);
+ private static final List<HoodieFileFormat> DEFAULT_SUPPORTED_FILE_FORMATS =
Arrays.asList(HoodieFileFormat.PARQUET);
Review Comment:
🤖 Removing ORC from `DEFAULT_SUPPORTED_FILE_FORMATS` and commenting out
CUSTOM merge mode tests significantly reduces test coverage for all users, not
just Gluten runs. These changes affect the base test class used by multiple
engines. Could you gate these reductions behind the `gluten.bundle.jar`
property instead of removing them unconditionally?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/others/TestMORFileSliceLayouts.scala:
##########
@@ -0,0 +1,823 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.hudi.dml.others
+
+import org.apache.hudi.HoodieCLIUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieLogFile}
+import org.apache.hudi.common.model.HoodieRecord.HoodieRecordType
+import org.apache.hudi.common.table.{HoodieTableMetaClient,
TableSchemaResolver}
+import org.apache.hudi.common.table.log.HoodieLogFileReader
+import
org.apache.hudi.common.table.log.block.HoodieLogBlock.{HeaderMetadataType,
HoodieLogBlockType}
+import org.apache.hudi.common.testutils.HoodieTestUtils
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.testutils.DataSourceTestUtils
+
+import org.apache.hadoop.fs.Path
+import org.apache.spark.sql.SaveMode
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+import
org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase.getMetaClientAndFileSystemView
+
+import java.util.Collections
+
+import scala.collection.JavaConverters._
+
+/**
+ * Creates 4 MOR tables with specific file slice layouts, validates each layout
+ * via the same HoodieTableFileSystemView + getLatestMergedFileSlicesBeforeOrOn
+ * code path that SELECT uses, then saves gold data alongside each table.
+ */
+class TestMORFileSliceLayouts extends HoodieSparkSqlTestBase {
+
+ val BASE_OUTPUT_PATH = "/home/ubuntu/ws3/hudi-rs/tables/fdss"
+
+ private def cleanPath(path: String): Unit = {
+ val hadoopPath = new Path(path)
+ val fs = hadoopPath.getFileSystem(spark.sessionState.newHadoopConf())
Review Comment:
🤖 This hardcodes an absolute path (`/home/ubuntu/ws3/hudi-rs/tables/fdss`)
that only exists on a specific developer machine. Tests will fail or write to
unexpected locations on any other system. Could you use `withTempDir` or
`getBasePath` instead?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestMORDataSource.scala:
##########
@@ -108,27 +108,18 @@ class TestMORDataSource extends HoodieSparkClientTestBase
with SparkDatasetMixin
@CsvSource(Array(
// Inferred as COMMIT_TIME_ORDERING
"AVRO, AVRO, avro, false,,,CURRENT", "AVRO, SPARK, parquet,
false,,,CURRENT",
- "SPARK, AVRO, parquet, false,,,CURRENT", "SPARK, SPARK, parquet,
false,,,CURRENT",
- "AVRO, AVRO, avro, false,,,EIGHT", "AVRO, SPARK, parquet, false,,,EIGHT",
- "SPARK, AVRO, parquet, false,,,EIGHT", "SPARK, SPARK, parquet,
false,,,EIGHT",
- // EVENT_TIME_ORDERING without precombine field
- "AVRO, AVRO, avro, false,,EVENT_TIME_ORDERING,CURRENT", "AVRO, SPARK,
parquet, false,,EVENT_TIME_ORDERING,CURRENT",
- "SPARK, AVRO, parquet, false,,EVENT_TIME_ORDERING,CURRENT", "SPARK, SPARK,
parquet, false,,EVENT_TIME_ORDERING,CURRENT",
- "AVRO, AVRO, avro, false,,EVENT_TIME_ORDERING,EIGHT", "AVRO, SPARK,
parquet, false,,EVENT_TIME_ORDERING,EIGHT",
- "SPARK, AVRO, parquet, false,,EVENT_TIME_ORDERING,EIGHT", "SPARK, SPARK,
parquet, false,,EVENT_TIME_ORDERING,EIGHT",
- // EVENT_TIME_ORDERING with empty precombine field
- "AVRO, AVRO, avro, true,,EVENT_TIME_ORDERING,CURRENT", "AVRO, SPARK,
parquet, true,,EVENT_TIME_ORDERING,CURRENT",
- "SPARK, AVRO, parquet, true,,EVENT_TIME_ORDERING,CURRENT", "SPARK, SPARK,
parquet, true,,EVENT_TIME_ORDERING,CURRENT",
- "AVRO, AVRO, avro, true,,EVENT_TIME_ORDERING,EIGHT", "AVRO, SPARK,
parquet, true,,EVENT_TIME_ORDERING,EIGHT",
- "SPARK, AVRO, parquet, true,,EVENT_TIME_ORDERING,EIGHT", "SPARK, SPARK,
parquet, true,,EVENT_TIME_ORDERING,EIGHT",
- // Inferred as EVENT_TIME_ORDERING
- "AVRO, AVRO, avro, true, timestamp,,CURRENT", "AVRO, SPARK, parquet, true,
timestamp,,CURRENT",
- "SPARK, AVRO, parquet, true, timestamp,,CURRENT", "SPARK, SPARK, parquet,
true, timestamp,,CURRENT",
- "AVRO, AVRO, avro, true, timestamp,,EIGHT", "AVRO, SPARK, parquet, true,
timestamp,,EIGHT",
- "SPARK, AVRO, parquet, true, timestamp,,EIGHT", "SPARK, SPARK, parquet,
true, timestamp,,EIGHT",
+ // "SPARK, AVRO, parquet, false,,,CURRENT", "SPARK, SPARK, parquet,
false,,,CURRENT",
Review Comment:
🤖 A large number of parameterized test cases have been commented out rather
than removed or conditionally skipped. This removes coverage for SPARK record
type, EVENT_TIME_ORDERING, table versions 6 and 8, and more. These are
important code paths. If they're incompatible with Gluten, could you skip them
conditionally (e.g., via `@EnabledIf`) rather than commenting them out?
<sub><i>- Generated by an AI agent and may contain mistakes. Please verify
any suggestions before applying.</i></sub>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]