partition path fields [hudi]

via GitHub Wed, 24 Jun 2026 02:17:44 -0700


hudi-agent commented on code in PR #19056:
URL: https://github.com/apache/hudi/pull/19056#discussion_r3465937469



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSchemaUtils.scala:
##########
@@ -308,4 +310,34 @@ object HoodieSchemaUtils {
         s" tableSchemaFields: $tableSchemaPartitionFields, partitionFields: 
$partitionFields.")
     }
   }
+
+  /**
+   * Checks whether a Spark [[StructField]] represents a Hudi BLOB column. 
BLOB is a logical
+   * type stored as a struct, identified by the 
[[HoodieSchema.TYPE_METADATA_FIELD]] metadata
+   * attached to the field. Applies to both INLINE and OUT_OF_LINE (EXTERNAL) 
blobs.
+   */
+  def isBlobField(field: StructField): Boolean = {
+    val md = field.metadata
+    md != null &&

Review Comment:
   🤖 nit: `StructField.metadata` is typed as `Metadata` in Spark (never null — 
it defaults to `Metadata.empty`), so the `md != null &&` guard is a dead check. 
It may confuse future readers into thinking null metadata is possible here; 
could you drop it and just start with `md.contains(...)`?
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -813,6 +818,31 @@ class HoodieSparkSqlWriterInternal {
     }
   }
 
+  /**
+   * Rejects BLOB-typed columns used as record key, ordering(preCombine), or 
partition path fields.
+   * BLOB columns hold large binary payloads (INLINE or EXTERNAL); using them 
as keys balloons the
+   * record key / shuffle / metadata index, or ties record identity to a 
storage path. Fails fast
+   * with a clear message naming the offending column(s).
+   */
+  private def validateBlobFieldUsage(schema: StructType,
+                                     optParams: Map[String, String],
+                                     resolver: Resolver): Unit = {
+    def check(spec: Option[String], usage: String): Unit = spec.foreach { s =>

Review Comment:
   🤖 nit: the inner function `check` is a bit opaque at the call sites — 
something like `rejectIfBlobField` would make lines 840–843 self-documenting 
without having to look up the implementation.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -813,6 +818,31 @@ class HoodieSparkSqlWriterInternal {
     }
   }
 
+  /**
+   * Rejects BLOB-typed columns used as record key, ordering(preCombine), or 
partition path fields.
+   * BLOB columns hold large binary payloads (INLINE or EXTERNAL); using them 
as keys balloons the
+   * record key / shuffle / metadata index, or ties record identity to a 
storage path. Fails fast
+   * with a clear message naming the offending column(s).
+   */
+  private def validateBlobFieldUsage(schema: StructType,
+                                     optParams: Map[String, String],
+                                     resolver: Resolver): Unit = {
+    def check(spec: Option[String], usage: String): Unit = spec.foreach { s =>
+      val blobs = HoodieSchemaUtils.findBlobFields(schema, s, resolver)
+      if (blobs.nonEmpty) {
+        throw new HoodieException(
+          s"BLOB type column(s) ${blobs.mkString("[", ", ", "]")} cannot be 
used as $usage. " +
+            "BLOB columns hold large binary payloads (INLINE or EXTERNAL) and 
are not supported as " +

Review Comment:
   🤖 nit: the sentence "BLOB columns hold large binary payloads (INLINE or 
EXTERNAL) and are not supported as record key, ordering/preCombine, or 
partition path fields." is copy-pasted into four places across three files 
(`HoodieSparkSqlWriter`, `HoodieCatalogTable`, and twice in 
`HoodieOptionConfig`). Extracting it as a constant in `HoodieSchemaUtils` would 
mean a single edit keeps all messages in sync.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix(spark): reject BLOB columns as record key / ordering / partition path fields [hudi]

Reply via email to