[I] [VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job [incubator-gluten]

via GitHub Thu, 28 Nov 2024 15:51:14 -0800


kecookier opened a new issue, #8088:
URL: https://github.com/apache/incubator-gluten/issues/8088


   ### Backend
   
   VL (Velox)
   
   ### Bug description
   
   The ShuffleWriter.default_leaf(velox::memory::MemoryPool) allocated too much 
memory in `VeloxHashShuffleWriter`, causing an off-heap OOM.
   ```
   24/11/26 21:31:42 ERROR Executor task launch worker for task 1559 
ManagedReservationListener: Error reserving memory from target
   
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget$OutOfMemoryException: 
Not enough spark off-heap execution memory. Acquired: 8.0 MiB, granted: 0.0 B. 
Try tweaking config option spark.memory.offHeap.size to get larger space to run 
this application (if spark.gluten.memory.dynamic.offHeap.sizing.enabled is not 
enabled). 
   
   Current config settings: 
        spark.gluten.memory.offHeap.size.in.bytes=13690208256
        spark.gluten.memory.task.offHeap.size.in.bytes=6845104128
        spark.gluten.memory.conservative.task.offHeap.size.in.bytes=3422552064
        spark.gluten.memory.dynamic.offHeap.sizing.enabled=false
   Memory consumer stats: 
        Task.1559:                                             Current used 
bytes:   8.4 GiB, peak bytes:        N/A
        \- Gluten.Tree.0:                                      Current used 
bytes:   8.4 GiB, peak bytes:   11.9 GiB
           \- root.0:                                          Current used 
bytes:   8.4 GiB, peak bytes:   11.9 GiB
              +- ShuffleWriter.0:                              Current used 
bytes:   8.3 GiB, peak bytes:    8.8 GiB
              |  \- single:                                    Current used 
bytes:   8.3 GiB, peak bytes:    8.8 GiB
              |     +- root:                                   Current used 
bytes:   8.2 GiB, peak bytes:    8.2 GiB
              |     |  \- default_leaf:                        Current used 
bytes:   8.2 GiB, peak bytes:    8.2 GiB
              |     \- gluten::MemoryAllocator:                Current used 
bytes:  62.9 MiB, peak bytes: 1436.4 MiB
              +- VeloxBatchAppender.0:                         Current used 
bytes: 104.0 MiB, peak bytes:  224.0 MiB
              |  \- single:                                    Current used 
bytes: 104.0 MiB, peak bytes:  224.0 MiB
              |     +- root:                                   Current used 
bytes: 100.2 MiB, peak bytes:  224.0 MiB
              |     |  \- default_leaf:                        Current used 
bytes: 100.2 MiB, peak bytes:  216.8 MiB
              |     \- gluten::MemoryAllocator:                Current used 
bytes:     0.0 B, peak bytes:      0.0 B
              +- NativePlanEvaluator-1.0:                      Current used 
bytes:  25.0 MiB, peak bytes:  176.0 MiB
              |  \- single:                                    Current used 
bytes:  25.0 MiB, peak bytes:  176.0 MiB
              |     +- root:                                   Current used 
bytes:  22.6 MiB, peak bytes:  169.0 MiB
              |     |  +- task.Gluten_Stage_2_TID_1559_VTID_0: Current used 
bytes:  22.6 MiB, peak bytes:  169.0 MiB
              |     |  |  +- node.0:                           Current used 
bytes:  22.1 MiB, peak bytes:  168.0 MiB
              |     |  |  |  +- op.0.0.0.TableScan:            Current used 
bytes:  22.1 MiB, peak bytes:  162.8 MiB
              |     |  |  |  \- op.0.0.0.TableScan.test-hive:  Current used 
bytes:     0.0 B, peak bytes:      0.0 B
              |     |  |  \- node.1:                           Current used 
bytes: 528.2 KiB, peak bytes: 1024.0 KiB
              |     |  |     \- op.1.0.0.FilterProject:        Current used 
bytes: 528.2 KiB, peak bytes:  849.5 KiB
              |     |  \- default_leaf:                        Current used 
bytes:     0.0 B, peak bytes:      0.0 B
              |     \- gluten::MemoryAllocator:                Current used 
bytes:     0.0 B, peak bytes:      0.0 B
              +- ArrowContextInstance.0:                       Current used 
bytes:     0.0 B, peak bytes:      0.0 B
              +- VeloxBatchAppender.0.OverAcquire.0:           Current used 
bytes:     0.0 B, peak bytes:   67.2 MiB
              +- IndicatorVectorBase#init.0.OverAcquire.0:     Current used 
bytes:     0.0 B, peak bytes:    2.4 MiB
              +- NativePlanEvaluator-1.0.OverAcquire.0:        Current used 
bytes:     0.0 B, peak bytes:   52.8 MiB
              +- ShuffleWriter.0.OverAcquire.0:                Current used 
bytes:     0.0 B, peak bytes:    2.6 GiB
              \- IndicatorVectorBase#init.0:                   Current used 
bytes:     0.0 B, peak bytes:    8.0 MiB
                 \- single:                                    Current used 
bytes:     0.0 B, peak bytes:    8.0 MiB
                    +- root:                                   Current used 
bytes:     0.0 B, peak bytes:      0.0 B
                    |  \- default_leaf:                        Current used 
bytes:     0.0 B, peak bytes:      0.0 B
                    \- gluten::MemoryAllocator:                Current used 
bytes:     0.0 B, peak bytes:      0.0 B
   
   
        at 
org.apache.gluten.memory.memtarget.ThrowOnOomMemoryTarget.borrow(ThrowOnOomMemoryTarget.java:66)
        at 
org.apache.gluten.memory.listener.ManagedReservationListener.reserve(ManagedReservationListener.java:49)
        at org.apache.gluten.vectorized.ShuffleWriterJniWrapper.write(Native 
Method)
        at 
org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:177)
        at 
org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:231)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54)
        at org.apache.spark.scheduler.Task.run(Task.scala:134)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:479)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1448)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:482)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   ```
   
   ### Where is VeloxMemoryPool used in VeloxHashShuffleWriter?
   When `splitComplexType()` is called, the vector will first be serialized by 
`PrestoVectorSerde`, and then flushed to cache by the function 
`evictPartitionBuffers()`. The memory held by `arenas_` will be freed only 
after flushing.
   
   ### Why is so much memory used?
   When `doSplit` is called, we estimate how many rows can fit within the 
current task's available memory, and then adapt the last partition buffers. We 
estimate without considering complex type columns, only simple columns. Thus, 
the memory of the complex type is missed. As we iterate batch by batch, we 
check if the current estimated rows are much larger than the already existing 
partition buffers. If so, we cache these buffers (evict partition buffer to 
payloadCache), and the cached payload will spill in the future, and then the 
memory is freed. f our complex type vector is large, the eviction is typically 
not triggered until the process has already run out of memory (OOM).
   
   ### Possible Solutions
   1. The default partition buffer size is 4096. In our case, the schema is 
`{int, string, map<string, string>, map<string, string>}`. Almost after 
iterating 200+ batches, the process will run out of memory. We can change this 
option to 200, and the job can succeed, but it's not a general solution.
   2. When estimating how many rows can fit within the current task's available 
memory, also consider complex type columns. We can use `arenas_` to do this.
   
   ### Spark version
   
   None
   
   ### Spark configurations
   
   _No response_
   
   ### System information
   
   _No response_
   
   ### Relevant logs
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [VL] HashShuffleWriter OOM when the schema contains a column of complex type in large scale job [incubator-gluten]

Reply via email to