danny0405 commented on code in PR #14090:
URL: https://github.com/apache/hudi/pull/14090#discussion_r2446553750
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java:
##########
@@ -199,6 +197,13 @@ public static <T> HoodieData<HoodieRecord>
convertWriteStatsToSecondaryIndexReco
});
return records.iterator();
});
+
+ // Deduplicate secondary index records by grouping by the secondary index
key
+ // (secondaryKey$recordKey). This handles the case where a record moves
from one file group to
+ // another (partition path update), which generates both a delete (from
old fileId) and an
+ // insert (to new fileId). Similar to how Record Level Index handles
partition path update,
+ // we prefer non-deleted records.
+ return HoodieTableMetadataUtil.reduceByKeys(secondaryIndexRecords,
parallelism, false);
Review Comment:
> Similar to how Record Level Index handles partition path update
The RLI will check the `HoodieRecord.ignoreIndexUpdate` flag to just ignore
the records of the delete, does SI has the same way to handle the updates? The
benifit is the updates sequence is irrevelent here and deterministic compared
to the preference you mentioned here.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]