[
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738235#comment-17738235
]
Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------
I suppose one problem with blocking the RPC handler is a single slow client
(whose channel gets very backed up) could affect all clients, if a majority of
handlers end up getting stuck in this state. I will have to think about how to
work around this, but also open to suggestions. One obvious possiblity would be
to limit the time it's allowed to block for, after which it closes the channel
as you said.
I think we need to find a balance – simply closing the connection whenever
there's a backlog will be very detrimental to throughput, since closing the
connection will cause the client to have to re-establish and retry everything.
On the other hand, we can't allow a noisy neighbor to hold all the handlers.
> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
> Issue Type: Bug
> Components: rpc
> Affects Versions: 2.6.0
> Reporter: Bryan Beaudreault
> Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters.
> This has mostly gone fine, except on 1 cluster. Most clusters, including this
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This
> cluster tends to get bursts of traffic, in which case it would typically jump
> to 400-500mb. Again this is sampled, so it could have been higher than that.
> When we enabled SSL on this cluster, we started seeing bursts up to at least
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's
> and general chaos on the cluster.
>
> We've gotten it under control a little bit by setting
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}.
> We've set netty's maxDirectMemory to be approx equal to
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain
> for clients but at least insulates the other components of the regionserver.
>
> We're still digging into exactly why this is happening. The cluster clearly
> has a bad access pattern, but it doesn't seem like SSL should increase the
> memory footprint by 5-10x like we're seeing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)