voonhous commented on PR #17669: URL: https://github.com/apache/hudi/pull/17669#issuecomment-3684906736
@yihua This PR should be different form the _O(1)_ comparison https://github.com/apache/hudi/pull/17672, which only affects workflows that involves `HoodieMetadataPayload`. For this PR, the main optimization here is avoiding the costly recursive schema checks. To be specific, performance improvements here comes from replacing the highly generic, recursive, and safety-heavy utility method with a specialized, flat, and shallow implementation. ## Less recursion: **Old way:** It uses a recursive switch statement. For every field, it calls `rewriteRecordWithNewSchema` again. If you have a record with 20 fields, that’s 20+ method calls, stack pushes/pops, and type check. **New Way:** The new `getRecordWithoutMetadata` performs a single, flat for loop over the top-level fields of the schema. Since CDC data stripping usually only happens at the top level, this avoids the overhead of traversing the entire object tree. ## I believe it also **alleviates GC pressure**: **Old Way:** The utility creates several helper objects for every record processed: - A `Deque<String> fieldNames` to track the breadcrumb path (mainly for error reporting IIUC). - `Map<String, String> renameCols` (even if empty). - Multiple string concatenations for `createNamePrefix` and `createFullName` **New Way:** - It creates exactly one `GenericData.Record`. No collections, no iterators, and no string builders are initialized per record. CMIIW @kamronis, the performance optimization here should see the most increase for records that are deeply nested. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
