ad1happy2go commented on code in PR #18948:
URL: https://github.com/apache/hudi/pull/18948#discussion_r3404714742


##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/InternalRowToJsonStringConverter.scala:
##########
@@ -41,7 +43,13 @@ class InternalRowToJsonStringConverter(schema: StructType) {
     val map = scala.collection.mutable.LinkedHashMap.empty[String, Any]
     schema.zipWithIndex.foreach {
       case (field, idx) =>
-        map(field.name) = convertField(record.get(idx, field.dataType), 
field.dataType)
+        // CDC before/after images must contain only business columns. Records 
read from base
+        // files or MOR log files carry the _hoodie_* meta columns, while 
images read from the
+        // supplemental CDC log already have them stripped at write time 
(HoodieCDCLogger). Skip
+        // the meta columns here so every inference case produces a 
schema-consistent image.
+        if 
(!HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(field.name)) {

Review Comment:
   Good call — moved the fix off the JSON converter and onto the row itself.
   
   The meta columns are on the `InternalRow` because the before/after image 
records are read through the file group reader, which keeps 
`_hoodie_record_key` and the ordering fields on the row since the MOR 
merge/dedup path needs them internally. So they cannot be dropped at read time 
without touching merging.
   
   Instead, `CDCFileGroupIterator` now projects each record onto the 
meta-stripped image schema (`HoodieSchemaUtils.removeMetadataFields`) via the 
engine-native `RecordContext.projectRecord(...)` right at the 
image-materialization point, caching the projection + converter per `schemaId` 
for schema evolution. The converter is reverted to its original schema-agnostic 
form. This is the same pattern `HoodieFileGroupReader` uses (requiredSchema for 
merging, requestedSchema for output).
   
   Verified locally via spark-shell across COW + MOR, all three supplemental 
logging modes, with and without inline compaction (insert/update/delete): 
pre-fix all variants leaked the 5 `_hoodie_*` columns into the after image; 
post-fix all are clean. Pushed as `8f2c23e`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to