[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Duo Zhang (Jira) Thu, 22 Jun 2023 19:47:23 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736340#comment-17736340
 ]


Duo Zhang commented on HBASE-27947:
-----------------------------------

Netty has a way to guess the input message size so it could allocate a ByteBuf 
which is larger enough.

You can see the code around RecvByteBufAllocator.

IIRC in spark they used to observer a problem that, when receiving large 
messages, and if the network is not so good, netty will waste a lot spaces, as 
the ByteBuf is large but it can only read a small amount of data, and if you 
use COMPOSITE_CUMULATOR instead of MERGE_CUMULATOR. things will get worse as 
you cache all the ByteBufs...

SslHandler just extends ByteToMessageDecoder, by default it uses 
MERGE_CUMULATOR, but for SslHandler, different SSLEngine will lead to different 
cumulator implementation. You can see the code around SslEngineType for more 
details.

Receiving large messages is always challenging, especially if you want to 
control the memory usage...

Thanks.

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to