voonhous commented on code in PR #17953:
URL: https://github.com/apache/hudi/pull/17953#discussion_r2711072298
##########
hudi-common/src/main/java/org/apache/hudi/common/table/read/BufferedRecordMergerFactory.java:
##########
@@ -249,7 +249,7 @@ public BufferedRecord<T> finalMerge(BufferedRecord<T>
olderRecord, BufferedRecor
Comparable oldOrderingValue = olderRecord.getOrderingValue();
HoodieSchema newSchema =
recordContext.getSchemaFromBufferRecord(newerRecord);
if (!olderRecord.isCommitTimeOrderingDelete()
- && oldOrderingValue.compareTo(newOrderingValue) > 0) {
+ && OrderingValues.compare(oldOrderingValue, newOrderingValue) > 0) {
Review Comment:
I have updated the issue, it'll be good if you can read through my detailed
analysis of:
1. **WHAT** is the ROOT CAUSE of the issue
2. **HOW** to trigger the issue and i have even included 2 test cases, 1
summarised, and another with an E2E flow
- I have also added within this test 16 test cases, all of which covers
_tableType{COW,MOR}, transformer{true,false}, HoodieRecordType{SPARK,AVRO}_
- Using this, we can see how the types differs with each and every of
the use cases/entrypoints
3. **WHERE** and **WHEN**, i.e. the code path that follows which will
trigger the issue
While i do agree that what we have looks hacky, it'll be extremely helpful
if you can at least take a look at my analysis and help to provide insights as
to why:
1. HoodieRecordType.AVRO with Json sources are generating Strings for the
problematic flow
2. If Strings are intended, then the older records should be read out in
String type too, instead of `Avro.UTF8`
3. If Strings are not intended, then we should convert them to `Avro.UTF8`,
but this will cause other separate issues when i tried doing that.
As can be seen, I have tried:
1. Fixing the issue at source, i.e. Since Side_A is not in the correct
format, i tried changing that side to what is expected to make things work, but
this did not work
2. Fixing the issue closer to the source, i.e. I Change the format of
Side_B, and that did not work too as i realised that will cause too many
cascading changes
3. When fixing the issue at source Side_A and Side_B did not work, i
resorted to normalizing, which I felt is the correct choice at this point in
time.
As mentioned, i do agree that this is hacky, but i feel what i am doing is
justified FOR NOW. Until someone provides better information AFTER reading my
analysis and point out what i have missed, my fix will stand for now until i
have more time to revisit this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]