[I] [VL] Executor reports an error due to OOM but not exit [incubator-gluten]

via GitHub Tue, 20 May 2025 00:22:01 -0700


dingxin-tech opened a new issue, #9698:
URL: https://github.com/apache/incubator-gluten/issues/9698


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   Hello Team, I found that when the Executor reports an error due to OOM, it 
will only print out the error log, but will not exit the Executor, resulting in 
no automatic retry.
   I can reproduce this problem using a small offheap memory. I am using Gluten 
version 1.2.1. Upgrading is not easy for me. Is this a known problem? Maybe I 
can cherry pick some bugfix patch?
   
   ```
   25/05/20 15:12:00 ERROR ManagedReservationListener: Error reserving memory 
from target
   
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: 
Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 341.3 
KiB. Try tweaking config option spark.memory.offHeap.size to get larger space 
to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled 
is not enabled). 
   Current config settings: 
        spark.gluten.memory.offHeap.size.in.bytes=1024.0 KiB
        spark.gluten.memory.task.offHeap.size.in.bytes=256.0 KiB
        spark.gluten.memory.conservative.task.offHeap.size.in.bytes=128.0 KiB
        spark.memory.offHeap.enabled=true
        spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
   Memory consumer stats: 
        Task.0:                             Current used bytes: 0.0 B, peak 
bytes:       N/A
        \- Gluten.Tree.3:                   Current used bytes: 0.0 B, peak 
bytes: 443.7 KiB
           \- root.3:                       Current used bytes: 0.0 B, peak 
bytes: 443.7 KiB
              +- OverAcquire.DummyTarget.0: Current used bytes: 0.0 B, peak 
bytes: 102.4 KiB
              \- RowToColumnar.3:           Current used bytes: 0.0 B, peak 
bytes: 341.3 KiB
   
        at 
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
        at 
org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
        at org.apache.gluten.exec.RuntimeJniWrapper.createRuntime(Native Method)
        at org.apache.gluten.exec.Runtime$RuntimeImpl.<init>(Runtime.scala:63)
        at org.apache.gluten.exec.Runtime$.apply(Runtime.scala:48)
        at org.apache.gluten.exec.Runtimes$.create(Runtimes.scala:33)
        at 
org.apache.gluten.exec.Runtimes$.$anonfun$contextInstance$1(Runtimes.scala:29)
        at 
org.apache.spark.util.TaskResourceRegistry.$anonfun$addResourceIfNotRegistered$1(TaskResources.scala:320)
        at 
org.apache.spark.util.TaskResourceRegistry.lock(TaskResources.scala:245)
        at 
org.apache.spark.util.TaskResourceRegistry.addResourceIfNotRegistered(TaskResources.scala:316)
        at 
org.apache.spark.util.TaskResources$.addResourceIfNotRegistered(TaskResources.scala:157)
        at org.apache.gluten.exec.Runtimes$.contextInstance(Runtimes.scala:29)
        at 
org.apache.gluten.execution.RowToVeloxColumnarExec$.toColumnarBatchIterator(RowToVeloxColumnarExec.scala:117)
        at 
org.apache.gluten.execution.RowToVeloxColumnarExec.$anonfun$doExecuteColumnarInternal$2(RowToVeloxColumnarExec.scala:72)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.gluten.execution.ColumnarInputRDDsWrapper.$anonfun$getIterators$1(WholeStageTransformer.scala:445)
        at scala.collection.immutable.List.flatMap(List.scala:366)
        at 
org.apache.gluten.execution.ColumnarInputRDDsWrapper.getIterators(WholeStageTransformer.scala:436)
        at 
org.apache.gluten.execution.WholeStageZippedPartitionsRDD.$anonfun$compute$1(WholeStageZippedPartitionsRDD.scala:48)
        at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
        at 
org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
        at 
org.apache.gluten.execution.WholeStageZippedPartitionsRDD.compute(WholeStageZippedPartitionsRDD.scala:46)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:879)
   ```
   
   
   
   ### Gluten version
   
   Gluten-1.2
   
   ### Spark version
   
   Spark-3.4.x
   
   ### Spark configurations
   
   spark.plugins=org.apache.gluten.GlutenPlugin
   spark.memory.offHeap.enabled=true
   spark.memory.offHeap.size=1m
   spark.shuffle.manager = org.apache.spark.shuffle.sort.ColumnarShuffleManager
   spark.gluten.sql.columnar.forceShuffledHashJoin=true
   spark.gluten.sql.native.writer.enabled = false
   
   ### System information
   
   Velox System Info v0.0.2
   Commit: ea7315f851bc5fb5aaf7ce5a0e27a97df4a0ec31
   CMake Version: 3.14.5
   System: Linux-4.19.91-011.ali4000.alios7.x86_64
   Arch: x86_64
   CPU Name: Model name:            Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
   C++ Compiler: /usr/local/alicpp/built/gcc-9.2.1/gcc-9.2.1/bin/g++
   C++ Compiler Version: 9.2.1
   C Compiler: /usr/local/alicpp/built/gcc-9.2.1/gcc-9.2.1/bin/gcc
   C Compiler Version: 9.2.1
   CMake Prefix Path: 
/usr/local;/usr;/;/usr/local;/usr/local;/usr/X11R6;/usr/pkg;/opt
   
   ### Relevant logs
   
   ```bash
   25/05/20 15:12:00 ERROR ManagedReservationListener: Error reserving memory 
from target
   
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: 
Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 341.3 
KiB. Try tweaking config option spark.memory.offHeap.size to get larger space 
to run this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled 
is not enabled). 
   Current config settings: 
        spark.gluten.memory.offHeap.size.in.bytes=1024.0 KiB
        spark.gluten.memory.task.offHeap.size.in.bytes=256.0 KiB
        spark.gluten.memory.conservative.task.offHeap.size.in.bytes=128.0 KiB
        spark.memory.offHeap.enabled=true
        spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
   Memory consumer stats: 
        Task.0:                             Current used bytes: 0.0 B, peak 
bytes:       N/A
        \- Gluten.Tree.3:                   Current used bytes: 0.0 B, peak 
bytes: 443.7 KiB
           \- root.3:                       Current used bytes: 0.0 B, peak 
bytes: 443.7 KiB
              +- OverAcquire.DummyTarget.0: Current used bytes: 0.0 B, peak 
bytes: 102.4 KiB
              \- RowToColumnar.3:           Current used bytes: 0.0 B, peak 
bytes: 341.3 KiB
   
        at 
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:105)
        at 
org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:43)
        at org.apache.gluten.exec.RuntimeJniWrapper.createRuntime(Native Method)
        at org.apache.gluten.exec.Runtime$RuntimeImpl.<init>(Runtime.scala:63)
        at org.apache.gluten.exec.Runtime$.apply(Runtime.scala:48)
        at org.apache.gluten.exec.Runtimes$.create(Runtimes.scala:33)
        at 
org.apache.gluten.exec.Runtimes$.$anonfun$contextInstance$1(Runtimes.scala:29)
        at 
org.apache.spark.util.TaskResourceRegistry.$anonfun$addResourceIfNotRegistered$1(TaskResources.scala:320)
        at 
org.apache.spark.util.TaskResourceRegistry.lock(TaskResources.scala:245)
        at 
org.apache.spark.util.TaskResourceRegistry.addResourceIfNotRegistered(TaskResources.scala:316)
        at 
org.apache.spark.util.TaskResources$.addResourceIfNotRegistered(TaskResources.scala:157)
        at org.apache.gluten.exec.Runtimes$.contextInstance(Runtimes.scala:29)
        at 
org.apache.gluten.execution.RowToVeloxColumnarExec$.toColumnarBatchIterator(RowToVeloxColumnarExec.scala:117)
        at 
org.apache.gluten.execution.RowToVeloxColumnarExec.$anonfun$doExecuteColumnarInternal$2(RowToVeloxColumnarExec.scala:72)
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.gluten.execution.ColumnarInputRDDsWrapper.$anonfun$getIterators$1(WholeStageTransformer.scala:445)
        at scala.collection.immutable.List.flatMap(List.scala:366)
        at 
org.apache.gluten.execution.ColumnarInputRDDsWrapper.getIterators(WholeStageTransformer.scala:436)
        at 
org.apache.gluten.execution.WholeStageZippedPartitionsRDD.$anonfun$compute$1(WholeStageZippedPartitionsRDD.scala:48)
        at org.apache.gluten.utils.Arm$.withResource(Arm.scala:25)
        at 
org.apache.gluten.metrics.GlutenTimeMetric$.millis(GlutenTimeMetric.scala:37)
        at 
org.apache.gluten.execution.WholeStageZippedPartitionsRDD.compute(WholeStageZippedPartitionsRDD.scala:46)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
        at 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
        at org.apache.spark.scheduler.Task.run(Task.scala:139)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1545)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:879)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [VL] Executor reports an error due to OOM but not exit [incubator-gluten]

Reply via email to