Re: [PR] [HUDI-14363] Strip _hoodie_* meta columns from CDC before/after images [hudi]

via GitHub Wed, 10 Jun 2026 20:28:49 -0700


danny0405 commented on code in PR #18948:
URL: https://github.com/apache/hudi/pull/18948#discussion_r3393065764



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/InternalRowToJsonStringConverter.scala:
##########
@@ -41,7 +43,13 @@ class InternalRowToJsonStringConverter(schema: StructType) {
     val map = scala.collection.mutable.LinkedHashMap.empty[String, Any]
     schema.zipWithIndex.foreach {
       case (field, idx) =>
-        map(field.name) = convertField(record.get(idx, field.dataType), 
field.dataType)
+        // CDC before/after images must contain only business columns. Records 
read from base
+        // files or MOR log files carry the _hoodie_* meta columns, while 
images read from the
+        // supplemental CDC log already have them stripped at write time 
(HoodieCDCLogger). Skip
+        // the meta columns here so every inference case produces a 
schema-consistent image.
+        if 
(!HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(field.name)) {

Review Comment:
   not sure why the metadata columns are in the cdc payload, can we fix the 
internalrow itself instead of here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-14363] Strip _hoodie_* meta columns from CDC before/after images [hudi]

Reply via email to