Re: [PR] test(schema): Add MOR log-only compaction tests for custom types [hudi]

via GitHub Thu, 07 May 2026 03:35:56 -0700


hudi-agent commented on code in PR #18583:
URL: https://github.com/apache/hudi/pull/18583#discussion_r3200756030



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala:
##########
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.hudi.dml.schema
+
+import org.apache.hudi.blob.BlobTestHelpers
+import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaType}
+import org.apache.hudi.testutils.DataSourceTestUtils
+import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient
+
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+
+import java.io.File
+
+class TestBlobDataType extends HoodieSparkSqlTestBase {
+
+  private val referenceStructType =
+    "struct<external_path:string, offset:bigint, length:bigint, 
managed:boolean>"
+
+  private def inlineBlobLiteral(hex: String): String =
+    s"""named_struct(
+       |  'type', 'INLINE',
+       |  'data', cast(X'$hex' as binary),
+       |  'reference', cast(null as $referenceStructType)
+       |)""".stripMargin
+
+  private def outOfLineBlobLiteral(path: String, offset: Long, length: Long): 
String =
+    s"""named_struct(
+       |  'type', 'OUT_OF_LINE',
+       |  'data', cast(null as binary),
+       |  'reference', named_struct(
+       |    'external_path', '$path',
+       |    'offset', cast($offset as bigint),
+       |    'length', cast($length as bigint),
+       |    'managed', false
+       |  )
+       |)""".stripMargin

Review Comment:
   _⚠️ Potential issue_ | _🟡 Minor_
   
   <details>
   <summary>🧩 Analysis chain</summary>
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && find . -name "TestBlobDataType.scala" 
-type f
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 136
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && cat -n 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala | 
head -200
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 10107
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && wc -l 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 138
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && grep -n "'\$" 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 243
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && sed -n '43,53p' 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 476
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && sed -n '175,183p' 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 285
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cd hudi-spark-datasource/hudi-spark && sed -n '60,75p' 
src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 548
   
   ---
   
   
   
   </details>
   
   **Escape single quotes in SQL string interpolations.**
   
   Paths interpolated at line 48 (`external_path`), line 66, and line 182 
(`location`) can contain single quotes, breaking SQL parsing. For example, a 
path like `/tmp/test'file/hudi` would produce invalid SQL.
   
   Escape single quotes by doubling them (SQL standard). Add a helper function 
and apply it to all path interpolations:
   
   <details>
   <summary>Proposed fix</summary>
   
   ```diff
   +  private def sqlStringLiteral(value: String): String = value.replace("'", 
"''")
   +
      private def outOfLineBlobLiteral(path: String, offset: Long, length: 
Long): String =
        s"""named_struct(
           |  'type', 'OUT_OF_LINE',
           |  'data', cast(null as binary),
           |  'reference', named_struct(
   -       |    'external_path', '$path',
   +       |    'external_path', '${sqlStringLiteral(path)}',
           |    'offset', cast($offset as bigint),
           |    'length', cast($length as bigint),
           |    'managed', false
           |  )
           |)""".stripMargin
   ```
   
   ```diff
   -           | location '$tablePath'
   +           | location '${sqlStringLiteral(tablePath)}'
   ```
   </details>
   
   Also applies to: 60-67, 176-183
   
   <details>
   <summary>🤖 Prompt for AI Agents</summary>
   
   ```
   Verify each finding against the current code and only fix it if needed.
   
   In
   
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
   around lines 43 - 53, The SQL string interpolations insert raw paths that may
   contain single quotes (e.g., in outOfLineBlobLiteral's 'external_path' and 
other
   SQL builders that set 'external_path' or 'location'), which will break SQL
   parsing; add a small helper (e.g., escapeSqlSingleQuotes(s: String): String =
   s.replace("'", "''")) and use it when embedding paths into SQL string 
literals
   so every occurrence (reference in outOfLineBlobLiteral and the other
   SQL-building functions that interpolate external_path/location) wraps the 
path
   with the helper before interpolation.
   ```
   
   </details>
   
   <!-- 
fingerprinting:phantom:poseidon:hawk:11cb87ff-ebec-417e-8faa-4ca0e3723635 -->
   
   <!-- d98c2f50 -->
   
   <!-- This is an auto-generated comment by CodeRabbit -->
   
   — *CodeRabbit* 
([original](https://github.com/hudi-agent/hudi/pull/18#discussion_r3143063277)) 
(source:comment#3143063277)



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala:
##########
@@ -0,0 +1,283 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.spark.sql.hudi.dml.schema
+
+import org.apache.hudi.blob.BlobTestHelpers
+import org.apache.hudi.common.schema.{HoodieSchema, HoodieSchemaType}
+import org.apache.hudi.testutils.DataSourceTestUtils
+import org.apache.hudi.testutils.HoodieClientTestUtils.createMetaClient
+
+import org.apache.spark.sql.hudi.common.HoodieSparkSqlTestBase
+
+import java.io.File
+
+class TestBlobDataType extends HoodieSparkSqlTestBase {
+
+  private val referenceStructType =
+    "struct<external_path:string, offset:bigint, length:bigint, 
managed:boolean>"
+
+  private def inlineBlobLiteral(hex: String): String =
+    s"""named_struct(
+       |  'type', 'INLINE',
+       |  'data', cast(X'$hex' as binary),
+       |  'reference', cast(null as $referenceStructType)
+       |)""".stripMargin
+
+  private def outOfLineBlobLiteral(path: String, offset: Long, length: Long): 
String =
+    s"""named_struct(
+       |  'type', 'OUT_OF_LINE',
+       |  'data', cast(null as binary),
+       |  'reference', named_struct(
+       |    'external_path', '$path',
+       |    'offset', cast($offset as bigint),
+       |    'length', cast($length as bigint),
+       |    'managed', false
+       |  )
+       |)""".stripMargin
+
+  test("Test Query Log Only MOR Table With BLOB INLINE column triggers 
compaction") {
+    withRecordType()(withTempDir { tmp =>
+      val tablePath = new File(tmp, "hudi").getCanonicalPath
+      val tableName = generateTableName
+      spark.sql(
+        s"""
+           |create table $tableName (
+           |  id int,
+           |  data blob,
+           |  ts long
+           |) using hudi
+           | location '$tablePath'
+           | tblproperties (
+           |  primaryKey = 'id',
+           |  type = 'mor',
+           |  preCombineField = 'ts',
+           |  hoodie.index.type = 'INMEMORY',
+           |  hoodie.compact.inline = 'true',
+           |  hoodie.clean.commits.retained = '1'
+           | )

Review Comment:
   _⚠️ Potential issue_ | _🟠 Major_
   
   <details>
   <summary>🧩 Analysis chain</summary>
   
   🏁 Script executed:
   
   ```shell
   #!/bin/bash
   set -euo pipefail
   # Verify whether the test pins compaction threshold and inspect config 
defaults/usage.
   rg -n 
"hoodie\.compact\.inline(\.max\.delta\.commits)?|clean\.commits\.retained" 
--type=scala
   rg -n 
"compact.*max.*delta|INLINE.*DELTA|hoodie\.compact\.inline\.max\.delta\.commits"
 --type=java --type=scala
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 46665
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cat -n 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
 | sed -n '50,120p'
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 3425
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cat -n 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
 | sed -n '210,240p'
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 1662
   
   ---
   
   🏁 Script executed:
   
   ```shell
   rg 
"INLINE_COMPACT_NUM_DELTA_COMMITS.*defaultValue|default.*INLINE|defaultValue.*hoodie\.compact\.inline\.max"
 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 -A 2 -B 2
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 295
   
   ---
   
   🏁 Script executed:
   
   ```shell
   rg "INLINE_COMPACT_NUM_DELTA_COMMITS.*=" 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 -A 3 | head -15
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 801
   
   ---
   
   🏁 Script executed:
   
   ```shell
   rg "\.defaultValue\(\)" 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java
 -B 5 | grep -A 5 "INLINE_COMPACT_NUM_DELTA_COMMITS"
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 526
   
   ---
   
   🏁 Script executed:
   
   ```shell
   cat -n 
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala
 | sed -n '175,195p'
   ```
   
   Repository: hudi-agent/hudi
   
   Length of output: 841
   
   ---
   
   
   
   </details>
   
   **Explicitly pin the compaction trigger threshold to improve test 
reliability.**
   
   The tests assume compaction occurs at the 5th delta commit (lines 105 and 
224), but rely on the implicit default value of 
`hoodie.compact.inline.max.delta.commits` (currently 5). While this happens to 
match, the implicit dependency makes tests brittle to future default changes. 
Many other tests in the codebase explicitly set this property; both test tables 
should follow that pattern.
   
   <details>
   <summary>Proposed fix</summary>
   
   ```diff
               | tblproperties (
               |  primaryKey = 'id',
               |  type = 'mor',
               |  preCombineField = 'ts',
               |  hoodie.index.type = 'INMEMORY',
               |  hoodie.compact.inline = 'true',
   +           |  hoodie.compact.inline.max.delta.commits = '5',
               |  hoodie.clean.commits.retained = '1'
               | )
   ```
   </details>
   
   Also applies to: lines 183-190
   
   <details>
   <summary>🤖 Prompt for AI Agents</summary>
   
   ```
   Verify each finding against the current code and only fix it if needed.
   
   In
   
`@hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestBlobDataType.scala`
   around lines 67 - 74, The test relies on compaction happening at the 5th 
delta
   commit but currently depends on the implicit default; update the table 
creation
   tblproperties in TestBlobDataType.scala to explicitly set
   hoodie.compact.inline.max.delta.commits = '5' for both table definitions (the
   existing tblproperties block that contains primaryKey, type='mor',
   preCombineField='ts', hoodie.index.type, hoodie.compact.inline,
   hoodie.clean.commits.retained) so compaction behavior is deterministic; apply
   the same explicit property to the second table's tblproperties block 
referenced
   around lines 183-190.
   ```
   
   </details>
   
   <!-- 
fingerprinting:phantom:poseidon:hawk:11cb87ff-ebec-417e-8faa-4ca0e3723635 -->
   
   <!-- 4e71b3a2 -->
   
   <!-- This is an auto-generated comment by CodeRabbit -->
   
   — *CodeRabbit* 
([original](https://github.com/hudi-agent/hudi/pull/18#discussion_r3143063278)) 
(source:comment#3143063278)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] test(schema): Add MOR log-only compaction tests for custom types [hudi]

Reply via email to