[
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748937#comment-17748937
]
Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------
{quote}WDYT?
{quote}
Yes, that's true. I can try to wrap up the simpler solution for now.
{quote}Do you have any references?
{quote}
I think I was noticing their
[NoSizeEstimator|https://github.com/apache/cassandra/blob/5815f0d5eb43ce890dc3ea71f45a7488e5c6163a/src/java/org/apache/cassandra/net/NoSizeEstimator.java],
which now I'm realizing is for outbound connections. In their
PipelineConfigurator, which I believe is the server configuration for client
requests, they just set a relatively low watermark of 8k - 32k. I don't see any
channelWritabilityChanged or isWritable checks, so not sure why they are
bothering with a watermark. Searching git history, they used to but lost them
in a major rewrite a few years ago. I don't really know much about cassandra
code or history, so trying not to draw too many conclusions.
> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
> Issue Type: Bug
> Components: rpc
> Affects Versions: 2.6.0
> Reporter: Bryan Beaudreault
> Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters.
> This has mostly gone fine, except on 1 cluster. Most clusters, including this
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This
> cluster tends to get bursts of traffic, in which case it would typically jump
> to 400-500mb. Again this is sampled, so it could have been higher than that.
> When we enabled SSL on this cluster, we started seeing bursts up to at least
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's
> and general chaos on the cluster.
>
> We've gotten it under control a little bit by setting
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}.
> We've set netty's maxDirectMemory to be approx equal to
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain
> for clients but at least insulates the other components of the regionserver.
>
> We're still digging into exactly why this is happening. The cluster clearly
> has a bad access pattern, but it doesn't seem like SSL should increase the
> memory footprint by 5-10x like we're seeing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)