[jira] [Updated] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Bryan Beaudreault (Jira) Sat, 12 Aug 2023 13:14:04 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Beaudreault updated HBASE-27947:
--------------------------------------
    Status: Patch Available  (was: Open)

[~zhangduo]  sorry for the very long delay. I have finally submitted a patch.

I also have some idea about why this was such a big problem. This [patch in 
netty|https://github.com/netty/netty/pull/7468] causes huge amounts of 
allocations when processing large messages with SSL. The messages need to be 
broken down into 16kb chunks, so the messages passes through this 
copyAndCompose over and over as they're broken down. The pressure on the 
PoolArena is enough to slow down outbound buffer flushing, which also needs to 
release buffers as it does.

I created a fork and reverted that change, and the OOMs disappeared even 
without the protections in my patch. Of course even with that reverted, it's 
still possible to back up the outbound socket with enough load, so I think this 
protections are still worth it.

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to