nsivabalan commented on code in PR #14090:
URL: https://github.com/apache/hudi/pull/14090#discussion_r2452410026
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java:
##########
@@ -199,6 +197,13 @@ public static <T> HoodieData<HoodieRecord>
convertWriteStatsToSecondaryIndexReco
});
return records.iterator();
});
+
+ // Deduplicate secondary index records by grouping by the secondary index
key
+ // (secondaryKey$recordKey). This handles the case where a record moves
from one file group to
+ // another (partition path update), which generates both a delete (from
old fileId) and an
+ // insert (to new fileId). Similar to how Record Level Index handles
partition path update,
+ // we prefer non-deleted records.
+ return HoodieTableMetadataUtil.reduceByKeys(secondaryIndexRecords,
parallelism, false);
Review Comment:
Hey @danny0405 : I attempted this atleast for streaming writes. but hit some
snags.
tricky part is, secondary col value also could be changing while record is
moved from one partition to another.
So, unless we have both versions of the records (previous version of the
record and the new record that goes into new partition), we can't do much just
from within the write handle.
So, can't think of other ways apart from adding the additional reduce stage
for streaming writes. For non-streaming writes, guess we can't do much and have
to add the reduce stage anyways.
CC @yihua
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]