rickyma commented on PR #1521:
URL: 
https://github.com/apache/incubator-uniffle/pull/1521#issuecomment-1942153288

   After stress testing the shuffle server without this PR, we will easily 
encounter `OutOfDirectMemoryError`, which means this PR is necessary.
   
   [epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - 
Exception in connection from /127.0.0.1:58767
   io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 
byte(s) of direct memory (used: 161061273600, max: 161061273600)
           at 
io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
           at 
io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
           at 
io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710)
           at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685)
           at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212)
           at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194)
           at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
           at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
           at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
           at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
           at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
           at 
org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)
           at 
org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95)
           at 
org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107)
           at 
org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145)
           at 
org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)
   
   We can see that each time an out-of-direct-memory error occurs, it is caused 
by the code 
`org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)`.
 This is the most direct trigger for insufficient direct memory.
   
   Because when a large number of requests arrive simultaneously, there might 
be a brief period (before the `TransportFrameDecoder` has a chance to 
`release`) during which the shuffle server has double the created `ByteBuf`. 
This means, for a very short time, the direct memory usage is doubled, which is 
extremely uncontrollable.
   That is why it is very easy to cause an out-of-direct-memory error without 
this PR.
   
   So, we need this PR anyway. We might slow down the flushing process a little 
bit(from the results of my tests, there doesn't seem to be any impact on 
performance.), but the shuffle server will at least remain available during the 
whole stress test. 
   Maybe we should prioritize ensuring availability first, and then consider 
deeper performance optimization later on?
   
   WDYT? @jerqi 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to