[
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17736251#comment-17736251
]
Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------
The cluster in question has a bunch of large (1mb+) rows, and there seems to be
a periodic job which causes large batches of multigets. The client in that job
has rpc timeout of 60s and operation timeout of 120s. We have our server-side
max result size set to 50mb, and have 80 handlers. During these spikes, we see
server side latencies and queue times elevated into the seconds. We also see a
bunch of concurrent requests which end up reaching our max result size.
A worst case of 50mb * 80 is greater than our netty maxDirectMemory, so could
theoretically be the problem. I tried lowering the max result size to 25mb and
still saw OutOfDirectMemoryErrors. It's also telling that we previously never
breached 1gb of direct memory for netty, and now going over 4gb. There must be
some new SSL-related allocations inflating the pool, and haven't figured out
yet how to tune.
> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.6.0
> Reporter: Bryan Beaudreault
> Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters.
> This has mostly gone fine, except on 1 cluster. Most clusters, including this
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This
> cluster tends to get bursts of traffic, in which case it would typically jump
> to 400-500mb. Again this is sampled, so it could have been higher than that.
> When we enabled SSL on this cluster, we started seeing bursts up to at least
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's
> and general chaos on the cluster.
>
> We've gotten it under control a little bit by setting
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}.
> We've set netty's maxDirectMemory to be approx equal to
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain
> for clients but at least insulates the other components of the regionserver.
>
> We're still digging into exactly why this is happening. The cluster clearly
> has a bad access pattern, but it doesn't seem like SSL should increase the
> memory footprint by 5-10x like we're seeing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)