Re: [PR] feat(blob): Create and Read Blobs in Spark SQL [hudi]

via GitHub Tue, 24 Mar 2026 03:15:12 -0700


voonhous commented on code in PR #18098:
URL: https://github.com/apache/hudi/pull/18098#discussion_r2980406564



##########
hudi-spark-datasource/hudi-spark3.4.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_4ExtendedSqlAstBuilder.scala:
##########
@@ -2689,13 +2692,24 @@ class HoodieSpark3_4ExtendedSqlAstBuilder(conf: 
SQLConf, delegate: ParserInterfa
       builder.putString("comment", _)
     }
 
+    val dataType = typedVisit[DataType](ctx.dataType)
+
+    addMetadataForType(ctx.dataType(), builder)
+
     StructField(
       name = colName.getText,
-      dataType = typedVisit[DataType](ctx.dataType),
+      dataType = dataType,
       nullable = NULL == null,
       metadata = builder.build())
   }
 
+  private def addMetadataForType(dataType: 
HoodieSqlBaseParser.DataTypeContext, builder: MetadataBuilder): Unit = {
+    val typeText = dataType.getText
+    if (typeText.equalsIgnoreCase(HoodieSchemaType.BLOB.name())) {
+      builder.putString(HoodieSchema.TYPE_METADATA_FIELD, 
HoodieSchemaType.BLOB.name())
+    }
+  }
+

Review Comment:
   We should be able to extract this  `addMetadataForType` intoto a shared 
trait or utility object to reduce the maintenance burden in the future. 
   
   Same for the other methods that are repeated across the different spark 
versions
   



##########
hudi-spark-datasource/hudi-spark3.3.x/src/main/scala/org/apache/spark/sql/parser/HoodieSpark3_3ExtendedSqlParser.scala:
##########
@@ -128,7 +128,8 @@ class HoodieSpark3_3ExtendedSqlParser(session: 
SparkSession, delegate: ParserInt
       normalized.contains("create index") ||
       normalized.contains("drop index") ||
       normalized.contains("show indexes") ||
-      normalized.contains("refresh index")
+      normalized.contains("refresh index") ||
+      normalized.contains(" blob")

Review Comment:
   +1 on my end, we can make this more strict in the future to something like 
`normalized.matches(".*\\bblob\\b(?!\\w).*")` (distinct blob word, and not part 
of a larger word like blobs, blobby, etc.) that only matches blob as a 
standalone token in a DDL-like position, or at minimum `normalized.contains(" 
blob,") || normalized.contains(" blob ") || normalized.contains(" blob\n")` to 
avoid matching `blob_path`, `blob_count"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(blob): Create and Read Blobs in Spark SQL [hudi]

Reply via email to