[jira] [Updated] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Bryan Beaudreault (Jira) Fri, 18 Aug 2023 07:23:12 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bryan Beaudreault updated HBASE-27947:
--------------------------------------
    Fix Version/s: 2.6.0
                   3.0.0-beta-1
     Release Note: 
When a slow client is not able to read responses from the server fast enough, 
the server side channel outbound buffer will grow. This can eventually lead to 
an OOM under extreme cases, and here we add new configurations to protect 
against that:
- hbase.server.netty.writable.watermark.low
- hbase.server.netty.writable.watermark.high
- hbase.server.netty.writable.watermark.fatal

When high watermark is exceeded, server will stop accepting new requests from 
the client. When outbound bytes drops below the low watermark, it will start 
again. This does not stop the server from processing already enqueued requests, 
so if those requests continue to grow the outbound bytes beyond the fatal 
threshold, the connection will be forcibly closed.

Also added new metrics for monitoring this situation in bean 
"Hadoop:service=HBase,name=RegionServer,sub=IPC":
 - UnwritableTime_* - histogram of time periods between when the high watermark 
was exceeded and when it eventually drops below low watermark. 
- nettyTotalPendingOutboundBytes - as the name suggests, for all channels the 
total amount of bytes waiting to be written to sockets
- nettyMaxPendingOutboundBytes - the number of bytes waiting on the most backed 
up channel across all channels
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

Thank you everyone for the input, and especially [~zhangduo] for detailed help 
and review. Thanks to [~norman] for his extra feedback and review on the 
upstream netty PR (to be integrated in follow-up task 
https://issues.apache.org/jira/browse/HBASE-28029

I pushed this to master, branch-2, and branch-3. Since it's not SSL specific, 
it could go to older branches but the diff is too complicated.

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Assignee: Bryan Beaudreault
>            Priority: Critical
>             Fix For: 2.6.0, 3.0.0-beta-1
>
>         Attachments: ssl-disabled-flamegraph.html, 
> ssl-enabled-flamegraph.html, ssl-enabled-optimized.html
>
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HBASE-27947) RegionServer OOM under load when TLS is enabled

Reply via email to