Re: [PR] [spark] supports updating blobs through DataEvolution MergeInto [paimon]

via GitHub Wed, 10 Jun 2026 01:46:53 -0700


JingsongLi commented on code in PR #8175:
URL: https://github.com/apache/paimon/pull/8175#discussion_r3386793652



##########
paimon-spark/paimon-spark-common/src/main/scala/org/apache/paimon/spark/commands/MergeIntoPaimonDataEvolutionTable.scala:
##########
@@ -351,19 +352,123 @@ case class MergeIntoPaimonDataEvolutionTable(
       .map { case (_, attrs) => attrs.head }
       .toSeq
 
-    val assignments = metadataColumns.map(column => Assignment(column, column))
-    val output = updateColumnsSorted ++ metadataColumns
+    // Find raw blob update columns and avoid reading them from target table
+    val blobInlineFields = table.coreOptions().blobInlineField().asScala.toSet
+    val rawBlobFieldNames = table
+      .rowType()
+      .getFields
+      .asScala
+      .filter(
+        field =>
+          field.`type`().is(BLOB) &&
+            !blobInlineFields.exists(inlineField => resolver(inlineField, 
field.name())))
+      .map(_.name())
+      .toSet
+
+    def isRawBlobUpdateColumn(attr: AttributeReference): Boolean = {
+      rawBlobFieldNames.exists(rawBlobFieldName => resolver(rawBlobFieldName, 
attr.name))
+    }
+
+    // The final output is composed by updated columns, metadata columns and 
blob marker columns.
+    // Marker columns are used to mark whether a blob field should be written 
with placeholder
+    val rawBlobUpdateColumns = 
updateColumnsSorted.filter(isRawBlobUpdateColumn)
+    val rawBlobMarkerNamesByColumn = rawBlobUpdateColumns.zipWithIndex.map {

Review Comment:
   The internal marker column names can collide with real target columns. For 
example, a table can legally have a column named 
`__paimon_raw_blob_placeholder_0`; if a MERGE updates that column and a raw 
BLOB in the same statement, `mergeOutput` will contain two attributes with the 
same name. Then `reorderPartialWriteColumns` selects by quoted name and 
`writePartialFields` resolves the marker with `data.schema.fieldIndex`, so 
Spark can either report an ambiguous reference or bind the user column as the 
boolean marker. Could we generate marker names that are guaranteed not to 
collide with the write columns/source output, or carry the marker attributes 
through by exprId instead of resolving them by name?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [spark] supports updating blobs through DataEvolution MergeInto [paimon]

Reply via email to