JacobZheng0927 opened a new pull request, #52210: URL: https://github.com/apache/spark/pull/52210
### What changes were proposed in this pull request? This PR optimizes BlockManager remove operations by introducing cached mappings to eliminate O(n) linear scans. The main changes are: 1. **Introduced three concurrent hash maps** to track block ID associations: - `rddToBlockIds`: Maps RDD ID to its block IDs - `broadcastToBlockIds`: Maps broadcast ID to its block IDs - `sessionToBlockIds`: Maps session UUID to its cache block IDs 2. **Added cache maintenance methods**: - `addToCache(blockId)`: Updates caches when blocks are stored - `removeFromCache(blockId)`: Updates caches when blocks are deleted 3. **Reworked remove operations** to use cached lookups: - `removeRdd()`, `removeBroadcast()`, and `removeCache()` now perform O(1) lookups instead of scanning all entries 4. **Integrated with block lifecycle**: - `doPutIterator()` calls `addToCache()` after successful block storage - `removeBlock()` calls `removeFromCache()` when blocks are removed ### Why are the changes needed? Previously, `removeRdd()`, `removeBroadcast()`, and `removeCache()` required scanning all blocks in `blockInfoManager.entries` to find matches. This approach becomes a serious bottleneck when: 1. **Large block counts**: In production deployments with millions or even tens of millions of cached blocks, linear scans can be prohibitively slow 2. **High cleanup frequency**: Workloads that repeatedly create and discard RDDs or broadcast variables accumulate overhead quickly The original `removeRdd()` method already contained a TODO noting that an additional mapping would be needed to avoid linear scans. This PR implements that improvement. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - **Unit tests**: Verified the correctness of `removeRdd()`, `removeBroadcast()`, and `removeCache()`, including edge cases. - **Stress tests**: Ran multiple simple tasks using broadcast joins under sustained high concurrency to validate performance and stability of the optimized remove operations. *Before optimization* <img width="822" height="305" alt="image" src="https://github.com/user-attachments/assets/4a619eff-393a-41e9-9610-712258956893" /> *After optimization* <img width="813" height="296" alt="image" src="https://github.com/user-attachments/assets/add6a6e6-df22-4ca5-972b-fe89e3c22332" /> The optimization delivers significant performance improvements for block cleanup under large data volumes, reducing the overhead caused by frequent GC when blocks accumulate. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org