[ 
https://issues.apache.org/jira/browse/HBASE-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17738710#comment-17738710
 ] 

Bryan Beaudreault commented on HBASE-27947:
-------------------------------------------

I have been trying to think about how to solve this without blocking RPC 
handlers.

The problem with just relying on setAutoRead(false) is it only pauses 
acceptance of new requests into the call queue. There will already be requests 
in progress by RPC handlers, and there could be even more requests queued in 
our call queue. Allowing them to publish to the channel can still result in OOM.

In terms of solving this without blocking RPC handlers, we might need to either 
clear or temporarily invalidate calls in the call queue originating from that 
channel. We could possibly achieve this by having the ServerCall retain a 
reference to the originating ServerRpcConnection. When a handler pulls a call 
from the queue, it checks if that call's connection.channel is writable. If not 
it could re-enqueue it, drop it, or maybe close the connection? Not sure yet. 
I'm open to any thoughts. Then the other question is what to do with calls 
which have already been in progress when the channel is made unwritable. Do we 
need a size-limited per-channel responseQueue?

> RegionServer OOM under load when TLS is enabled
> -----------------------------------------------
>
>                 Key: HBASE-27947
>                 URL: https://issues.apache.org/jira/browse/HBASE-27947
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.6.0
>            Reporter: Bryan Beaudreault
>            Priority: Critical
>
> We are rolling out the server side TLS settings to all of our QA clusters. 
> This has mostly gone fine, except on 1 cluster. Most clusters, including this 
> one have a sampled {{nettyDirectMemory}} usage of about 30-100mb. This 
> cluster tends to get bursts of traffic, in which case it would typically jump 
> to 400-500mb. Again this is sampled, so it could have been higher than that. 
> When we enabled SSL on this cluster, we started seeing bursts up to at least 
> 4gb. This exceeded our {{{}-XX:MaxDirectMemorySize{}}}, which caused OOM's 
> and general chaos on the cluster.
>  
> We've gotten it under control a little bit by setting 
> {{-Dorg.apache.hbase.thirdparty.io.netty.maxDirectMemory}} and 
> {{{}-Dorg.apache.hbase.thirdparty.io.netty.tryReflectionSetAccessible{}}}. 
> We've set netty's maxDirectMemory to be approx equal to 
> ({{{}-XX:MaxDirectMemorySize - BucketCacheSize - ReservoirSize{}}}). Now we 
> are seeing netty's own OutOfDirectMemoryError, which is still causing pain 
> for clients but at least insulates the other components of the regionserver.
>  
> We're still digging into exactly why this is happening. The cluster clearly 
> has a bad access pattern, but it doesn't seem like SSL should increase the 
> memory footprint by 5-10x like we're seeing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to