voonhous commented on code in PR #18938: URL: https://github.com/apache/hudi/pull/18938#discussion_r3463846891
########## hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieVariantReconstruction.java: ########## Review Comment: Good catch, this found a bug. Added `TestHoodieVariantReconstructionRoundTrip` (in hudi-spark4-common, so the real provider is auto-detected): a `create` -> `reconstruct` round-trip over a record with a shredded variant column plus a non-variant column, asserting the variant is rebuilt, the non-variant column passes through, and the field positions line up. It failed on first run: `create` reused the requested field instances for non-target columns when assembling the intermediate read schema, and those Avro `Field`s are already bound to the requested record - so `Schema.setFields` throws `Field already used` as soon as a non-variant column sits alongside a shredded variant (i.e. every real table). The target branch only avoided it because `withSchema()` builds a fresh `Field`. Fixed by doing the same for non-target fields. Both in 50ab440. It stayed latent because nothing exercised this path: Spark compaction reads base files via the InternalRow reader, and the Java/Flink AVRO path fails fast first (no provider). ########## hudi-hadoop-common/src/main/java/org/apache/hudi/io/storage/hadoop/HoodieVariantReconstruction.java: ########## Review Comment: Good catch, this found a bug. Added `TestHoodieVariantReconstructionRoundTrip` (in hudi-spark4-common, so the real provider is auto-detected): a `create` -> `reconstruct` round-trip over a record with a shredded variant column plus a non-variant column, asserting the variant is rebuilt, the non-variant column passes through, and the field positions line up. It failed on first run: `create` reused the requested field instances for non-target columns when assembling the intermediate read schema, and those Avro `Field`s are already bound to the requested record - so `Schema.setFields` throws `Field already used` as soon as a non-variant column sits alongside a shredded variant (i.e. every real table). The target branch only avoided it because `withSchema()` builds a fresh `Field`. Fixed by doing the same for non-target fields. It stayed latent because nothing exercised this path: Spark compaction reads base files via the InternalRow reader, and the Java/Flink AVRO path fails fast first (no provider). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
