sunchao opened a new pull request, #56293: URL: https://github.com/apache/spark/pull/56293
### Why are the changes needed? `TaskMemoryManager.allocatePage()` first acquires execution memory and then asks the Tungsten allocator to create the physical page. If the physical allocator throws `OutOfMemoryError`, Spark currently retains the grant and recursively calls `allocatePage()` again. Repeated physical allocation failures can therefore accumulate acquired-but-unused grants and retry indefinitely instead of recovering or failing promptly. This is the generic allocator failure mechanism discussed by SPARK-54354; SPARK-54818 improved its diagnostics but left the recursive retry behavior unchanged. ### What changes were proposed in this PR? - Replace recursive page allocation with an iterative retry against the existing execution-memory grant. - Spill task-managed consumers directly after allocator OOM and retry only while tracked consumer memory decreases. - Return allocation failure when no consumer can make progress, allowing callers to raise `SparkOutOfMemoryError`. - Prevent nested page allocations from recursively entering allocator recovery. - Make failed-grant cleanup idempotent. - Allocate `ShuffleInMemorySorter` replacement arrays lazily after spilling and restore/recheck the pointer array when page allocation triggers a spill. ### How was this PR tested? - Added deterministic `TaskMemoryManagerSuite` coverage for allocator OOM without spillable memory, successful spill and retry using the same grant, nested page allocation during recovery, and off-heap allocator failure. - Added shuffle sorter coverage for lazy pointer-array reset, cleanup after reset, and data-page allocation triggering a spill. - `TaskMemoryManagerSuite` - `ShuffleInMemorySorterSuite` - `ShuffleExternalSorterSuite` - `UnsafeExternalSorterSuite` (26 JUnit tests passed during a broader core Java test run) - Core reactor compile and checkstyle - Scalafmt validation - `git diff --check` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
