voonhous commented on code in PR #17953:
URL: https://github.com/apache/hudi/pull/17953#discussion_r2711072298


##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecordMergerFactory.java:
##########
@@ -249,7 +249,7 @@ public BufferedRecord<T> finalMerge(BufferedRecord<T> 
olderRecord, BufferedRecor
       Comparable oldOrderingValue = olderRecord.getOrderingValue();
       HoodieSchema newSchema = 
recordContext.getSchemaFromBufferRecord(newerRecord);
       if (!olderRecord.isCommitTimeOrderingDelete()
-          && oldOrderingValue.compareTo(newOrderingValue) > 0) {
+          && OrderingValues.compare(oldOrderingValue, newOrderingValue) > 0) {

Review Comment:
   I have updated the issue, it'll be good if you can read through my detailed 
analysis of:
   
   1. **WHAT** is the ROOT CAUSE of the issue
   2. **HOW** to trigger the issue and i have even included 2 test cases, 1 
summarised, and another with an E2E flow
       - I have also added within this test 16 test cases, all of which covers 
_tableType{COW,MOR}, transformer{true,false}, HoodieRecordType{SPARK,AVRO}_
       - Using this, we can see how the types differs with each and every of 
the use cases/entrypoints
   3. **WHERE** and **WHEN**, i.e. the code path that follows which will 
trigger the issue
   
   While i do agree that what we have looks hacky, it'll be extremely helpful 
if you can at least take a look at my analysis and help to provide insights as 
to why:
   
   1. HoodieRecordType.AVRO with Json sources are generating Strings for the 
problematic flow
   2. If Strings are intended, then the older records should be read out in 
String type too, instead of `Avro.UTF8`
   3. If Strings are not intended, then we should convert them to `Avro.UTF8`, 
but this will cause other separate issues when i tried doing that.
   
   As can be seen, I have tried:
   1. Fixing the issue at source, i.e. Since Side_A is not in the correct 
format, i tried changing that side to what is expected to make things work, but 
this did not work
   2. Fixing the issue closer to the source, i.e. I Change the format of 
Side_B, and that did not work too as i realised that will cause too many 
cascading changes
   3. When fixing the issue at source Side_A and Side_B did not work, i 
resorted to normalizing, which I felt is the correct choice at this point in 
time. 
   
   As mentioned, i do agree that this is hacky, but i feel what i am doing is 
justified FOR NOW. Until someone provides better information AFTER reading my 
analysis and point out what i have missed, my fix will stand for now until i 
have more time to revisit this issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to