JacobZheng0927 opened a new pull request, #52210:
URL: https://github.com/apache/spark/pull/52210

   ### What changes were proposed in this pull request?
   
   This PR optimizes BlockManager remove operations by introducing cached 
mappings to eliminate O(n) linear scans. The main changes are:
   
   1. **Introduced three concurrent hash maps** to track block ID associations:
      - `rddToBlockIds`: Maps RDD ID to its block IDs  
      - `broadcastToBlockIds`: Maps broadcast ID to its block IDs  
      - `sessionToBlockIds`: Maps session UUID to its cache block IDs
   
   2. **Added cache maintenance methods**:
      - `addToCache(blockId)`: Updates caches when blocks are stored  
      - `removeFromCache(blockId)`: Updates caches when blocks are deleted  
   
   3. **Reworked remove operations** to use cached lookups:
      - `removeRdd()`, `removeBroadcast()`, and `removeCache()` now perform 
O(1) lookups instead of scanning all entries  
   
   4. **Integrated with block lifecycle**:
      - `doPutIterator()` calls `addToCache()` after successful block storage  
      - `removeBlock()` calls `removeFromCache()` when blocks are removed  
     
   ### Why are the changes needed?
   
   Previously, `removeRdd()`, `removeBroadcast()`, and `removeCache()` required 
scanning all blocks in `blockInfoManager.entries` to find matches. This 
approach becomes a serious bottleneck when:
   
   1. **Large block counts**: In production deployments with millions or even 
tens of millions of cached blocks, linear scans can be prohibitively slow  
   2. **High cleanup frequency**: Workloads that repeatedly create and discard 
RDDs or broadcast variables accumulate overhead quickly
   
   The original `removeRdd()` method already contained a TODO noting that an 
additional mapping would be needed to avoid linear scans. This PR implements 
that improvement.
   
   ### Does this PR introduce any user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   - **Unit tests**: Verified the correctness of `removeRdd()`, 
`removeBroadcast()`, and `removeCache()`, including edge cases.  
   - **Stress tests**: Ran multiple simple tasks using broadcast joins under 
sustained high concurrency to validate performance and stability of the 
optimized remove operations.
   
   
   *Before optimization*
   <img width="822" height="305" alt="image" 
src="https://github.com/user-attachments/assets/4a619eff-393a-41e9-9610-712258956893";
 />
   
   *After optimization*
   
   <img width="813" height="296" alt="image" 
src="https://github.com/user-attachments/assets/add6a6e6-df22-4ca5-972b-fe89e3c22332";
 />
   
   The optimization delivers significant performance improvements for block 
cleanup under large data volumes, reducing the overhead caused by frequent GC 
when blocks accumulate.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to