[
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17739130#comment-17739130
]
Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------
Thanks for the input [~binlijin]! I'm concerned about closing the connection
because it can be highly disruptive to the caller. Also, with SSL it's
complicated because you can't just close an SSL connection – the underlying SSL
handler tries to flush and close_notify first, which is async and can take a
bit. I did a quick test with closing connection and still saw OOM for my case.
For those reasons, I'm trying to find a middle ground where we drop requests
first.
Currently I am iterating on the following:
* channelWritabilityChanged manages three states:
** Channel autoread
** An AtomicBoolean denoting the writable state
** A TimerTask which, unless cancelled by going back into writable state,
closes the connection after a configurable time period of unwritability.
* When an RPC handler executes CallRunner, we check if the channel is
writable. If not, drop the request.
* When we close the channel, we also set another AtomicBoolean. When
attempting to write a response to the pipeline, if the boolean is false, we
throw a ConnectionClosingException. We can't rely on channel.isOpen() because
SSL close_notify might delay this returning false.
I'm working on a comprehensive list of configurations for the high and low
watermark and the closeAfterUnwritable millis, to find a good happy medium for
my extreme test case. Once I have that, I'll try it on a more normal load test
case. Ideally anything I do here will not affect normal load performance at all.
I will note that doing setAutoRead(false) drastically reduces overall
throughput for the bad caller. When we manage setAutoRead, the test runs about
50% fewer requests in the allotted time. This is because the caller spends a
lot more time in a situation where it can't even enqueue requests to the
server. This may be ok or even preferable, but I want to see how it plays out
in different scenarios.
> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
> Key: HBASE-27947
> URL: https://issues.apache.org/jira/browse/HBASE-27947
> Project: HBase
> Issue Type: Bug
> Components: rpc
> Affects Versions: 2.6.0
> Reporter: Bryan Beaudreault
> Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters.
> This has mostly gone fine, except on 1 cluster. Most clusters, including this
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This
> cluster tends to get bursts of traffic, in which case it would typically jump
> to 400-500mb. Again this is sampled, so it could have been higher than that.
> When we enabled SSL on this cluster, we started seeing bursts up to at least
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's
> and general chaos on the cluster.
>
> We've gotten it under control a little bit by setting
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}.
> We've set netty's maxDirectMemory to be approx equal to
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain
> for clients but at least insulates the other components of the regionserver.
>
> We're still digging into exactly why this is happening. The cluster clearly
> has a bad access pattern, but it doesn't seem like SSL should increase the
> memory footprint by 5-10x like we're seeing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)