voonhous commented on issue #17938:
URL: https://github.com/apache/hudi/issues/17938#issuecomment-3770468378

   # Summary of root cause
   Found the root cause:
   
   In `StreamSync` we do these things (which is used by DeltaStreamer, which 
IIRC is renamed to something else in current repo)
   
   
https://github.com/apache/hudi/blob/a55bc00f8e7097a18bd6ecb82470e6576e4edaf0/hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java#L508-L533
   
   1. readFromSource
       - Uses `MercifulJsonConverter` which converts Strings to 
`java.lang.String`
   3. writeToSinkAndDoMetaSync
       - Does required merging using `BufferedRecordMergerFactory`
       - Commits data
       - Performs metasync after commit is successful
   
   `MercifulJsonConverter` code that returns `java.lang.String`:
   
https://github.com/apache/hudi/blob/a55bc00f8e7097a18bd6ecb82470e6576e4edaf0/hudi-common/src/main/java/org/apache/hudi/avro/MercifulJsonConverter.java#L441-L456
   
   Line 447 returns the value as String. 
   Hence the incoming records that is invoked using `readFromSource`, the 
String columns (**orderingVal** column) will hence be in `java.lang.String`. 
   
   But for records with String columns that are read from baseFile, they will 
be read out as Avro's Utf8, hence, when performing the merging, it triggers the 
merge error.
   
   TLDR of the root cause: `MercifulJsonConverter.StringProcessor` returns 
`java.lang.String` directly instead of wrapping it in `Utf8`. This is 
inconsistent with Avro's standard behavior where strings should be `Utf8` by 
default.
   
   
   # Dataflow references
   1. JSON sources -> SourceFormatAdapter.fetchNewDataInAvroFormat() -> 
AvroConvertor.fromJson() -> MercifulJsonConverter.convert() -> java.lang.String 
(via StringProcessor which does NOT convert to Utf8)
   2. ROW sources -> HoodieSparkUtils.createRdd() -> AvroSerializer -> Utf8 
(Spark's AvroSerializer: new Utf8(getter.getUTF8String(ordinal).getBytes))      
                                               
   3. AVRO sources → Returns directly what the source provides (typically Utf8 
when reading standard Avro files)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to