zhan7236 opened a new pull request, #2687:
URL: https://github.com/apache/uniffle/pull/2687
### What changes were proposed in this pull request?
Replace `HashSet<Long>` with `Roaring64NavigableMap` in
`RssShuffleWriter#checkSentBlockCount` method for both Spark2 and Spark3
clients. This optimization uses a compressed bitmap data structure to filter
duplicate blockIds from multiple replicas.
Changes:
- Added import for `org.roaringbitmap.longlong.Roaring64NavigableMap`
- Replaced `Set<Long> blockIds = new HashSet<>()` with
`Roaring64NavigableMap blockIdBitmap = Roaring64NavigableMap.bitmapOf()`
- Changed `blockIds.addAll(x)` to `x.forEach(blockIdBitmap::addLong)`
- Changed `blockIds.size()` to `blockIdBitmap.getLongCardinality()`
### Why are the changes needed?
`Roaring64NavigableMap` is a compressed bitmap data structure that is more
memory-efficient than `HashSet<Long>`, especially when storing large numbers of
blockIds (which are typically consecutive or near-consecutive long integers).
This optimization can significantly reduce memory usage in large-scale shuffle
scenarios.
Fix: #2675
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Compiled successfully with both `spark2` and `spark3` profiles
- All existing unit tests pass:
- `RssShuffleWriterTest` for Spark3: 7 tests passed
- `RssShuffleWriterTest` for Spark2: 3 tests passed
- Code style verified with `mvn spotless:check`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]