Re: [PR] perf: optimize removeCommitMetadata method in HoodieCDCLogger [hudi]

via GitHub Mon, 22 Dec 2025 19:08:34 -0800


voonhous commented on PR #17669:
URL: https://github.com/apache/hudi/pull/17669#issuecomment-3684906736


   @yihua This PR should be different form the _O(1)_ comparison 
https://github.com/apache/hudi/pull/17672, which only affects workflows that 
involves `HoodieMetadataPayload`.
   
   For this PR, the main optimization here is avoiding the costly recursive 
schema checks. To be specific, performance improvements here comes from 
replacing the highly generic, recursive, and safety-heavy utility method with a 
specialized, flat, and shallow implementation.
   
   ## Less recursion:
   
   **Old way:**
   It uses a recursive switch statement. For every field, it calls 
`rewriteRecordWithNewSchema` again. If you have a record with 20 fields, that’s 
20+ method calls, stack pushes/pops, and type check.
   
   **New Way:**
   The new `getRecordWithoutMetadata` performs a single, flat for loop over the 
top-level fields of the schema. Since CDC data stripping usually only happens 
at the top level, this avoids the overhead of traversing the entire object tree.
   
   
   ## I believe it also **alleviates GC pressure**:
   
   **Old Way:** 
   
   The utility creates several helper objects for every record processed:
   - A `Deque<String> fieldNames` to track the breadcrumb path (mainly for 
error reporting IIUC).
   - `Map<String, String> renameCols` (even if empty).
   - Multiple string concatenations for `createNamePrefix` and `createFullName`
   
   **New Way:**
   - It creates exactly one `GenericData.Record`. No collections, no iterators, 
and no string builders are initialized per record.
   
   
   CMIIW @kamronis, the performance optimization here should see the most 
increase for records that are deeply nested.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf: optimize removeCommitMetadata method in HoodieCDCLogger [hudi]

Reply via email to