This is an automated email from the ASF dual-hosted git repository.

viirya pushed a commit to branch branch-4.x
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-4.x by this push:
     new df31a9708bcb [SPARK-56904][SQL] Fix Int overflow in LongToUnsafeRowMap 
page size computations
df31a9708bcb is described below

commit df31a9708bcbe458d0af5c87664ba536be283b7d
Author: Liang-Chi Hsieh <[email protected]>
AuthorDate: Sun May 17 15:00:41 2026 -0700

    [SPARK-56904][SQL] Fix Int overflow in LongToUnsafeRowMap page size 
computations
    
    ### What changes were proposed in this pull request?
    
    Fix three sites in `LongToUnsafeRowMap` where a `Long` page-word count is 
multiplied by 8 using `Int` arithmetic. At the upper bound (`1 << 30` long 
words, the explicit cap in `grow` plus the 8 GiB ceiling), `Int * 8` wraps to 0:
    
      - `LongToUnsafeRowMap.grow`: `val newPage = 
allocatePage(newNumWords.toInt * 8)`
      - `LongToUnsafeRowMap.read` (deserialization on executors): `page = 
allocatePage(pageLength * 8)` `cursor = pageLength * 8 + page.getBaseOffset`
    
    When the multiplication overflows to 0, `MemoryConsumer.allocatePage(0)` 
falls through `TaskMemoryManager.allocatePage(Math.max(pageSize, 0))` and 
returns a default-sized page. Subsequent `append`s keep advancing `cursor` past 
the new page's end and `Platform.copyMemory(... page.getBaseObject, cursor, 
...)` writes/reads into adjacent native pages, eventually crashing inside the 
SIMD-optimized `StubRoutines::forward_copy_longs` on aarch64 (SEGV_ACCERR at 
the over-read of the next mmap page).
    
    We observed the crash on ARM Graviton; this fix resolves it. The bug is a 
latent heap corruption regardless of architecture.
    
    Fix: use `Long` multiplication (`* 8L`) at all three sites so the multiply 
matches `allocatePage`/`cursor`'s declared `Long` types.
    
    ### Why are the changes needed?
    
    To fix a JVM SEGV in `LongToUnsafeRowMap` triggered when the page reaches 
the 8 GiB cap, observed on ARM Graviton.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Existing `HashedRelationSuite` tests cover the affected paths. Validated on 
a downstream broadcast-hash-join build on ARM Graviton where the original SEGV 
reproduced; no crash with this fix applied.
    
    The reproducible suite is internal and it is hard to port to OSS. But the 
bug can be observed from the code clearly.
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Claude Code
    
    Closes #55929 from viirya/SPARK-54116-fix-int-overflow.
    
    Authored-by: Liang-Chi Hsieh <[email protected]>
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
    (cherry picked from commit bccbf2234a97e52830c3f6417806e0fe25a7c229)
    Signed-off-by: Liang-Chi Hsieh <[email protected]>
---
 .../apache/spark/sql/execution/joins/HashedRelation.scala | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
index 242185e80357..7712fdc9f6cc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala
@@ -825,7 +825,13 @@ private[execution] final class LongToUnsafeRowMap(
         throw QueryExecutionErrors.cannotBuildHashedRelationLargerThan8GError()
       }
       val newNumWords = math.max(neededNumWords, math.min(page.size() / 8 * 2, 
1 << 30))
-      val newPage = allocatePage(newNumWords.toInt * 8)
+      // newNumWords is a Long up to 1 << 30. Multiplying by 8 must stay in 
Long
+      // arithmetic; `newNumWords.toInt * 8` (Int * Int) overflows to 0 at the
+      // upper bound, causing `allocatePage(0)` to fall back to the default 
page
+      // size while subsequent writes still advance `cursor` past the new 
page's
+      // end (heap corruption observed as a `forward_copy_longs` SEGV during
+      // BHJ build on aarch64).
+      val newPage = allocatePage(newNumWords * 8L)
       Platform.copyMemory(page.getBaseObject, page.getBaseOffset, 
newPage.getBaseObject,
         newPage.getBaseOffset, usedBytes)
       freePage(page)
@@ -966,10 +972,13 @@ private[execution] final class LongToUnsafeRowMap(
     readData(readBuffer, array.memoryBlock.getBaseObject, 
array.memoryBlock.getBaseOffset, length)
     val pageLength = readLong().toInt
     freePage(page)
-    page = allocatePage(pageLength * 8)
+    // Use Long multiplication: pageLength can be up to 1 << 30 (8 GiB page / 
8),
+    // and `Int * Int` overflows at that bound, leading to a 0-byte 
allocatePage
+    // and a subsequent cursor that runs past the page's end.
+    page = allocatePage(pageLength * 8L)
     readData(readBuffer, page.getBaseObject, page.getBaseOffset, pageLength)
     // Restore cursor variable to make this map able to be serialized again on 
executors.
-    cursor = pageLength * 8 + page.getBaseOffset
+    cursor = pageLength * 8L + page.getBaseOffset
   }
 
   override def readExternal(in: ObjectInput): Unit = {


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to