prashantwason opened a new pull request, #17774: URL: https://github.com/apache/hudi/pull/17774
### Describe the issue this Pull Request addresses This PR optimizes the bloom index key lookup by using `Set` instead of `ArrayList` for storing candidate record keys. ArrayList has large memory overhead which occurs when the ArrayList grows beyond its initially allocated size. Set is better suited to an exists check and avoids the need to copy the collection when calling `filterRowKeys()`. ### Summary and Changelog - Changed `candidateRecordKeys` in `HoodieKeyLookupHandle` from `ArrayList<String>` to `HashSet<String>` - Updated `filterKeysFromFile` method signature in `HoodieIndexUtils` to accept `Set<String>` instead of `List<String>` - Removed unnecessary `.stream().collect(Collectors.toSet())` call since the input is already a Set ### Impact No public API changes. This is an internal optimization that reduces memory overhead when looking up a large number of keys during bloom index operations. ### Risk Level low - This is a straightforward type change from ArrayList to HashSet with no behavioral changes. The Set semantics are actually more appropriate since we're checking for key existence. ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
