nsivabalan commented on code in PR #14090:
URL: https://github.com/apache/hudi/pull/14090#discussion_r2452187707


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java:
##########
@@ -199,6 +197,13 @@ public static <T> HoodieData<HoodieRecord> 
convertWriteStatsToSecondaryIndexReco
       });
       return records.iterator();
     });
+
+    // Deduplicate secondary index records by grouping by the secondary index 
key
+    // (secondaryKey$recordKey). This handles the case where a record moves 
from one file group to
+    // another (partition path update), which generates both a delete (from 
old fileId) and an
+    // insert (to new fileId). Similar to how Record Level Index handles 
partition path update,
+    // we prefer non-deleted records.
+    return HoodieTableMetadataUtil.reduceByKeys(secondaryIndexRecords, 
parallelism, false);

Review Comment:
   it may not line up easily for SI as it did for RLI. 
   if we consider, append handle, what we do for SI record generation is, 
   we take entire file slice (base file, log files prior to current commit) and 
find record keys -> SI value mapping. 
   and we take the file slice including the new log files added as part of 
current commit and find record keys -> SI value mapping. 
   
   And find difference to compute the SI records for MDT partition. 
   One thing, we could do is to keep hold of set of record keys where 
`ignoreIndexUpdate` is true and then remove from final set of records we 
compute for SI. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to