sunchao opened a new pull request, #56293:
URL: https://github.com/apache/spark/pull/56293

   ### Why are the changes needed?
   
   `TaskMemoryManager.allocatePage()` first acquires execution memory and then 
asks the Tungsten allocator to create the physical page. If the physical 
allocator throws `OutOfMemoryError`, Spark currently retains the grant and 
recursively calls `allocatePage()` again.
   
   Repeated physical allocation failures can therefore accumulate 
acquired-but-unused grants and retry indefinitely instead of recovering or 
failing promptly. This is the generic allocator failure mechanism discussed by 
SPARK-54354; SPARK-54818 improved its diagnostics but left the recursive retry 
behavior unchanged.
   
   ### What changes were proposed in this PR?
   
   - Replace recursive page allocation with an iterative retry against the 
existing execution-memory grant.
   - Spill task-managed consumers directly after allocator OOM and retry only 
while tracked consumer memory decreases.
   - Return allocation failure when no consumer can make progress, allowing 
callers to raise `SparkOutOfMemoryError`.
   - Prevent nested page allocations from recursively entering allocator 
recovery.
   - Make failed-grant cleanup idempotent.
   - Allocate `ShuffleInMemorySorter` replacement arrays lazily after spilling 
and restore/recheck the pointer array when page allocation triggers a spill.
   
   ### How was this PR tested?
   
   - Added deterministic `TaskMemoryManagerSuite` coverage for allocator OOM 
without spillable memory, successful spill and retry using the same grant, 
nested page allocation during recovery, and off-heap allocator failure.
   - Added shuffle sorter coverage for lazy pointer-array reset, cleanup after 
reset, and data-page allocation triggering a spill.
   - `TaskMemoryManagerSuite`
   - `ShuffleInMemorySorterSuite`
   - `ShuffleExternalSorterSuite`
   - `UnsafeExternalSorterSuite` (26 JUnit tests passed during a broader core 
Java test run)
   - Core reactor compile and checkstyle
   - Scalafmt validation
   - `git diff --check`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to