[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Duo Zhang (Jira) Fri, 23 Jun 2023 02:52:04 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736435#comment-17736435
 ]


Duo Zhang commented on HBASE-27947:
-----------------------------------

Oh, seems I misunderstood the description. I thought there are large writes 
(1mb+ rows) periodically to the cluster...

If you can see that the memory spike is always at the same time with the 
multiget request spike, then probably the problem is not about receiving and 
cumulating.

When writing data back, the SslHandler needs to split the data to 16KB chunks, 
so for a single connection, it is not likely to have a out buffer larger than 
16KB, of course if this slows down the processing, we may buffer a lot of cells 
in memory, but these cells are in our own memory pool, not netty's, so if you 
can see netty's OOME,  there must be other problems.

And the max memory is not 50MB * 80, the rpc handler will not block on sending 
data back, as we use non blocking socket here. The handler will begin to 
process other requests right after adding the ByteBuf to a connection's output 
buffer. So the max memory is more likely related to the connections, not rpc 
handlers.

Thanks.

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to