[I] [Bug] [client] Memory leak may in `WriteBufferManager` [incubator-uniffle]

via GitHub Mon, 16 Dec 2024 01:32:15 -0800


xumanbu opened a new issue, #2294:
URL: https://github.com/apache/incubator-uniffle/issues/2294


   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the bug
   
   
   A task failed due to insufficient executor memory, yet the allocated bytes 
for the task is 0 after one executor executed many tasks.
   
   The error log below may be unexpected since `spark.executor.cores` is set to 
1.
   ```
   org.apache.uniffle.common.exception.RssException: 
   Can't get memory to cache shuffle data,
   request[16777216], got[0], WriteBufferManager allocated[0] task used[0]. It 
may be caused by shuffle server is full of data or consider to optimize 
'spark.executor.memory', 'spark.rss.writer.buffer.spill.size'.
   ```
   
   ### Affects Version(s)
   
   0.10
   
   ### Uniffle Server Log Output
   
   _No response_
   
   ### Uniffle Engine Log Output
   
   ```logtalk
   24/12/13 14:05:46 INFO [Executor task launch worker for task 19059] 
ComposedClientReadHandler: Client read 157 blocks from 
[ShuffleServerInfo{host[10.xx], grpc port[xx], netty port[xx]}], Consumed[ 
hot:0 warm:157 cold:0 frozen:0 ], Skipped[ hot:0 warm:0 cold:0 frozen:0 ]
   24/12/13 14:05:46 INFO [Executor task launch worker for task 19059] 
ComposedClientReadHandler: Client read 699687660 bytes from 
[ShuffleServerInfo{host[10.xxx], grc port[xx], netty port[xx]}], Consumed[ 
hot:0 warm:699687660 cold:0 frozen:0 ], Skipped[ hot:0 warm:0 cold:0 frozen:0 ]
   24/12/13 14:05:46 INFO [Executor task launch worker for task 19059] 
ComposedClientReadHandler: Client read 2161016354 uncompressed bytes from 
[ShuffleServerInfo{host[10.xx], grpc port[xx], netty port[49999]}], Consumed[ 
hot:0 warm:2161016354 cold:0 frozen:0 ], Skipped[ hot:0 warm:0 cold:0 frozen:0 ]
   24/12/13 14:05:46 INFO [Executor task launch worker for task 19059] 
RssShuffleDataIterator: Fetch 699687660 bytes cost 26210 ms and 4 ms to 
serialize, 568 ms to decompress with unCompressionLength[2161016354]
   24/12/13 14:05:46 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[0] again, 
request[16777216], got[0] less than 193484
   24/12/13 14:05:49 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[1] again, 
request[16777216], got[0] less than 193484
   24/12/13 14:05:52 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[2] again, 
request[16777216], got[0] less than 193484
   24/12/13 14:05:55 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[3] again, 
request[16777216], got[0] less than 193484
   .......
   24/12/13 15:05:45 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[1199] again, 
request[16777216], got[0] less than 193484
   24/12/13 15:05:48 INFO [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory for now, sleep and try[1200] again, 
request[16777216], got[0] less than 193484
   24/12/13 15:05:51 ERROR [Executor task launch worker for task 19059] 
WriteBufferManager: Can't get memory to cache shuffle data, request[16777216], 
got[0], WriteBufferManager allocated[0] task used[0]. It may be caused by 
shuffle server is full of data or consider to optimize 'spark.executor.memory', 
'spark.rss.writer.buffer.spill.size'.
   24/12/13 15:05:51 ERROR [Executor task launch worker for task 19059] 
Executor: Exception in task 18962.0 in stage 3.0 (TID 19059)
   org.apache.uniffle.common.exception.RssException: Can't get memory to cache 
shuffle data, request[16777216], got[0], WriteBufferManager allocated[0] task 
used[0]. It may be caused by shuffle server is full of data or consider to 
optimize 'spark.executor.memory', 'spark.rss.writer.buffer.spill.size'.
        at 
org.apache.spark.shuffle.writer.WriteBufferManager.requestExecutorMemory(WriteBufferManager.java:462)
        at 
org.apache.spark.shuffle.writer.WriteBufferManager.requestMemory(WriteBufferManager.java:420)
        at 
org.apache.spark.shuffle.writer.WriteBufferManager.insertIntoBuffer(WriteBufferManager.java:265)
        at 
org.apache.spark.shuffle.writer.WriteBufferManager.addPartitionData(WriteBufferManager.java:205)
        at 
org.apache.spark.shuffle.writer.WriteBufferManager.addRecord(WriteBufferManager.java:317)
        at 
org.apache.spark.shuffle.writer.RssShuffleWriter.writeImpl(RssShuffleWriter.java:303)
        at 
org.apache.spark.shuffle.writer.RssShuffleWriter.write(RssShuffleWriter.java:272)
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
        at org.apache.spark.scheduler.Task.run(Task.scala:135)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$runWithUgi$3(Executor.scala:462)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1381)
        at 
org.apache.spark.executor.Executor$TaskRunner.runWithUgi(Executor.scala:465)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:394)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
   ```
   
   
   ### Uniffle Server Configurations
   
   _No response_
   
   ### Uniffle Engine Configurations
   
   _No response_
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] [Bug] [client] Memory leak may in `WriteBufferManager` [incubator-uniffle]

Reply via email to