[PR] perf(index): Use Set instead of ArrayList to reduce memory overhead in key lookup [hudi]

via GitHub Sat, 03 Jan 2026 22:02:51 -0800


prashantwason opened a new pull request, #17774:
URL: https://github.com/apache/hudi/pull/17774


   ### Describe the issue this Pull Request addresses
   
   This PR optimizes the bloom index key lookup by using `Set` instead of 
`ArrayList` for storing candidate record keys. ArrayList has large memory 
overhead which occurs when the ArrayList grows beyond its initially allocated 
size. Set is better suited to an exists check and avoids the need to copy the 
collection when calling `filterRowKeys()`.
   
   ### Summary and Changelog
   
   - Changed `candidateRecordKeys` in `HoodieKeyLookupHandle` from 
`ArrayList<String>` to `HashSet<String>`
   - Updated `filterKeysFromFile` method signature in `HoodieIndexUtils` to 
accept `Set<String>` instead of `List<String>`
   - Removed unnecessary `.stream().collect(Collectors.toSet())` call since the 
input is already a Set
   
   ### Impact
   
   No public API changes. This is an internal optimization that reduces memory 
overhead when looking up a large number of keys during bloom index operations.
   
   ### Risk Level
   
   low - This is a straightforward type change from ArrayList to HashSet with 
no behavioral changes. The Set semantics are actually more appropriate since 
we're checking for key existence.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] perf(index): Use Set instead of ArrayList to reduce memory overhead in key lookup [hudi]

Reply via email to