nsivabalan commented on code in PR #14090:
URL: https://github.com/apache/hudi/pull/14090#discussion_r2452410026


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java:
##########
@@ -199,6 +197,13 @@ public static <T> HoodieData<HoodieRecord> 
convertWriteStatsToSecondaryIndexReco
       });
       return records.iterator();
     });
+
+    // Deduplicate secondary index records by grouping by the secondary index 
key
+    // (secondaryKey$recordKey). This handles the case where a record moves 
from one file group to
+    // another (partition path update), which generates both a delete (from 
old fileId) and an
+    // insert (to new fileId). Similar to how Record Level Index handles 
partition path update,
+    // we prefer non-deleted records.
+    return HoodieTableMetadataUtil.reduceByKeys(secondaryIndexRecords, 
parallelism, false);

Review Comment:
   Hey @danny0405 : I attempted this atleast for streaming writes. but hit some 
snags. 
   tricky part is, secondary col value also could be changing while record is 
moved from one partition to another. 
   So, unless we have both versions of the records (previous version of the 
record and the new record that goes into new partition), we can't do much just 
from within the write handle. 
   
   So, can't think of other ways apart from adding the additional reduce stage 
for streaming writes. For non-streaming writes, guess we can't do much and have 
to add the reduce stage anyways. 
   CC @yihua 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to