[ https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738849#comment-17738849 ]
Bryan Beaudreault commented on HBASE-27947: ------------------------------------------- Sorry for all the updates, but I’ve had some success today with the non-blocking idea in my last comment. I did the simple thing and had handlers drop calls off the channel is not writable when they pull from the queue. I set the netty high watermark to 5mb and low watermark to 512kb. This still resolved the OOMs but handler usage is better. I’m going to do more testing tomorrow before I package it up for a PR > RegionServer OOM under load when TLS is enabled > ----------------------------------------------- > > Key: HBASE-27947 > URL: https://issues.apache.org/jira/browse/HBASE-27947 > Project: HBase > Issue Type: Bug > Components: rpc > Affects Versions: 2.6.0 > Reporter: Bryan Beaudreault > Priority: Critical > > We are rolling out the server side TLS settings to all of our QA clusters. > This has mostly gone fine, except on 1 cluster. Most clusters, including this > one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This > cluster tends to get bursts of traffic, in which case it would typically jump > to 400-500mb. Again this is sampled, so it could have been higher than that. > When we enabled SSL on this cluster, we started seeing bursts up to at least > 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's > and general chaos on the cluster. > > We've gotten it under control a little bit by setting > {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and > {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. > We've set netty's maxDirectMemory to be approx equal to > ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we > are seeing netty's own OutOfDirectMemoryError, which is still causing pain > for clients but at least insulates the other components of the regionserver. > > We're still digging into exactly why this is happening. The cluster clearly > has a bad access pattern, but it doesn't seem like SSL should increase the > memory footprint by 5-10x like we're seeing. -- This message was sent by Atlassian Jira (v8.20.10#820010)