ad1happy2go commented on code in PR #18948:
URL: https://github.com/apache/hudi/pull/18948#discussion_r3404714742
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/cdc/InternalRowToJsonStringConverter.scala:
##########
@@ -41,7 +43,13 @@ class InternalRowToJsonStringConverter(schema: StructType) {
val map = scala.collection.mutable.LinkedHashMap.empty[String, Any]
schema.zipWithIndex.foreach {
case (field, idx) =>
- map(field.name) = convertField(record.get(idx, field.dataType),
field.dataType)
+ // CDC before/after images must contain only business columns. Records
read from base
+ // files or MOR log files carry the _hoodie_* meta columns, while
images read from the
+ // supplemental CDC log already have them stripped at write time
(HoodieCDCLogger). Skip
+ // the meta columns here so every inference case produces a
schema-consistent image.
+ if
(!HoodieRecord.HOODIE_META_COLUMNS_WITH_OPERATION.contains(field.name)) {
Review Comment:
Good call — moved the fix off the JSON converter and onto the row itself.
The meta columns are on the `InternalRow` because the before/after image
records are read through the file group reader, which keeps
`_hoodie_record_key` and the ordering fields on the row since the MOR
merge/dedup path needs them internally. So they cannot be dropped at read time
without touching merging.
Instead, `CDCFileGroupIterator` now projects each record onto the
meta-stripped image schema (`HoodieSchemaUtils.removeMetadataFields`) via the
engine-native `RecordContext.projectRecord(...)` right at the
image-materialization point, caching the projection + converter per `schemaId`
for schema evolution. The converter is reverted to its original schema-agnostic
form. This is the same pattern `HoodieFileGroupReader` uses (requiredSchema for
merging, requestedSchema for output).
Verified locally via spark-shell across COW + MOR, all three supplemental
logging modes, with and without inline compaction (insert/update/delete):
pre-fix all variants leaked the 5 `_hoodie_*` columns into the after image;
post-fix all are clean. Pushed as `8f2c23e`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]