[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Bryan Beaudreault (Jira) Wed, 28 Jun 2023 07:33:01 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738150#comment-17738150
 ]


Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------

I implemented a pretty rudimentary POC for applying backpressure. Basically 
added a channelWritabilityChanged handler in NettyRpcFrameDecoder. When 
encountered, it sets an AtomicBoolean on the NettyServerRpcConnection and calls 
{{notifyAll()}} if writable is true. Then in 
NettyServerCall.sendResponseIfReady, we check if the channel is writable and if 
not, we {{wait()}} and poll the AtomicBoolean until it is true. 
sendResponseIfReady is called from a number of places, including within the 
event loop. We don't want to block the event loop, so I added a {{boolean 
canBlock}} argument to the method, which is only true when called from 
CallRunner.run().

This has resolved the OOMs for my test case, and the total throughput 
achievable by the test is now much more similar to when run through haproxy. 
Currently seeing about 5% throughput reduction compared to haproxy, but I'm 
also not yet sure if that could be due to variance/noise since the test case is 
so extreme. This is with tcnative/boringSSL.

 

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to