[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Bryan Beaudreault (Jira) Wed, 28 Jun 2023 08:54:10 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738186#comment-17738186
 ]


Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------

Thanks, will do. Can you elaborate on this part?
{quote}even if you block the handler, the output buffer is still there...
{quote}
The blocking occurs prior to calling {{{}channel.writeAndFlush(this){}}}. I 
realize that the handler is still holding resources for the response in our own 
hbase ByteBuffAllocator. This shouldn't be a huge problem since the handler is 
not able to do more work while holding it, since it's blocked. The big issue is 
that the handler is currently able to produce a large buffer, and then 
immediately start doing work which might create other large buffers. So 
blocking stops that at the expense of response times.

The ChannelOutputBuffer is still there and still has contents which netty is 
draining into the socket. Blocking prior to calling writeAndFlush gives netty 
time to drain the ChannelOutputBuffer before accepting more.

To me this is sort of similar to the natural handler backpressure that occurs 
when, for example, the disk is slow. When disk is fast, a handler might be 
active for a few millis or less per-request. When disk is saturated, it might 
be active for a lot longer per request. The handler is blocked during that 
time, but is currently never blocked when the output socket is saturated.

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to