Re: [PR] fix: Handle deletes and updates properly in secondary index [hudi]

via GitHub Mon, 20 Oct 2025 19:19:18 -0700


danny0405 commented on code in PR #14090:
URL: https://github.com/apache/hudi/pull/14090#discussion_r2446553750



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/SecondaryIndexRecordGenerationUtils.java:
##########
@@ -199,6 +197,13 @@ public static <T> HoodieData<HoodieRecord> 
convertWriteStatsToSecondaryIndexReco
       });
       return records.iterator();
     });
+
+    // Deduplicate secondary index records by grouping by the secondary index 
key
+    // (secondaryKey$recordKey). This handles the case where a record moves 
from one file group to
+    // another (partition path update), which generates both a delete (from 
old fileId) and an
+    // insert (to new fileId). Similar to how Record Level Index handles 
partition path update,
+    // we prefer non-deleted records.
+    return HoodieTableMetadataUtil.reduceByKeys(secondaryIndexRecords, 
parallelism, false);

Review Comment:
   > Similar to how Record Level Index handles partition path update
   
   The RLI will check the `HoodieRecord.ignoreIndexUpdate` flag to just ignore 
the records of the delete, does SI has the same way to handle the updates? The 
benifit is the updates sequence is irrevelent here and deterministic compared 
to the preference you mentioned here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] fix: Handle deletes and updates properly in secondary index [hudi]

Reply via email to