[ 
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552322#comment-17552322
 ] 

Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 4:54 PM:
---------------------------------------------------------------------

[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with {_}auth{_}, which was the previous setting.

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks – read on below – 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly.

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. I can see in the 
sources, via my IDE, that the shader rewrote the string constants containing 
the property keys too. Various resources on the Internet will offer 
documentation and suggestions, but because we relocated Netty into thirdparty, 
the names have changed, and so naively following the advice on StackOverflow 
and other places will have no effect. Key here is recommendations when you want 
to prefer heap instead of direct memory.

Let me list them in terms of relevancy for addressing this issue.

Highly relevant:
 - io.netty.allocator.cacheTrimInterval -> 
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
 -- This is the number of threshold of allocations when cached entries will be 
freed up if not frequently used. Lowering it from the default of 8192 may 
reduce the overall amount of direct memory retained in steady state, because 
the evaluation will be performed more often, as often as you specify.
 - io.netty.noPreferDirect -> 
org.apache.hbase.thirdparty.io.netty.noPreferDirect
 -- This will prefer heap arena allocations regardless of PlatformDependent 
ideas on preference if set to 'true'.
 - io.netty.allocator.numDirectArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
 -- Various advice on the Internet suggests setting numDirectArenas=0 and 
noPreferDirect=true as the way to prefer heap based buffers.

Less relevant:
 - io.netty.allocator.maxCachedBufferCapacity -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
 -- This is the sized based retention policy for buffers; individual buffers 
larger than this will not be cached.
 - io.netty.allocator.numHeapArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
 - io.netty.allocator.pageSize -> 
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
 - io.netty.allocator.maxOrder -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder

On [https://github.com/apache/hbase/pull/4505] I have a draft PR that allows 
the user to tweak the Netty bytebuf allocation policy. This may be a good idea 
to do in general. We may want to provide support for some of the above Netty 
tunables in HBase site configuration as well, as a way to eliminate confusion 
about them... Our documentation on it would describe the HBase site config 
property names.

On a side note, we might spike on an alternative to SASL RPC that is a TLS 
based implementation instead. I know this has been discussed and even partially 
attempted, repeatedly, over our history but nonetheless the operational and 
performance issues with SASL remain.


was (Author: apurtell):
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have 
been testing with _auth_, which was the previous setting. 

[~vjasani] I am curious if you apply my patch and set 
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still 
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for 
buffers at load driving the usage, and not excessive cache retention in the 
pooled allocator. Let's see if experimental results confirm the hypothesis. If 
it helps then I am wrong and pooling configuration tweaks -- read on below -- 
should be considered. If I am correct then we should investigate how to get 
direct IO buffers freed faster and/or limits or pacing applied to their 
allocation; using a custom allocator, possibly. 

Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue 
people may be facing is confusion about system property names. Various 
resources on the Internet will offer documentation and suggestions, but because 
we relocated Netty into thirdparty, the names have changed, and so naively 
following the advice on StackOverflow and other places will have no effect. Key 
here is recommendations when you want to prefer heap instead of direct memory.

Let me list them in terms of relevancy for addressing this issue.

Highly relevant:
- io.netty.allocator.cacheTrimInterval -> 
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
-- This is the number of threshold of allocations when cached entries will be 
freed up if not frequently used. Lowering it from the default of 8192 may 
reduce the overall amount of direct memory retained in steady state, because 
the evaluation will be performed more often, as often as you specify.
- io.netty.noPreferDirect -> org.apache.hbase.thirdparty.io.netty.noPreferDirect
-- This will prefer heap arena allocations regardless of PlatformDependent 
ideas on preference if set to 'true'. 
- io.netty.allocator.numDirectArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
-- Various advice on the Internet suggests setting numDirectArenas=0 and 
noPreferDirect=true as the way to prefer heap based buffers.

Less relevant:
- io.netty.allocator.maxCachedBufferCapacity -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
-- This is the sized based retention policy for buffers; individual buffers 
larger than this will not be cached.
- io.netty.allocator.numHeapArenas -> 
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
- io.netty.allocator.pageSize -> 
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
- io.netty.allocator.maxOrder -> 
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder 

On https://github.com/apache/hbase/pull/4505 I have a draft PR that allows the 
user to tweak the Netty bytebuf allocation policy. This may be a good idea to 
do in general. We may want to provide support for some of the above Netty 
tunables in HBase site configuration as well, as a way to eliminate confusion 
about them... Our documentation on it would describe the HBase site config 
property names. 

On a side note, we might spike on an alternative to SASL RPC that is a TLS 
based implementation instead. I know this has been discussed and even partially 
attempted, repeatedly, over our history but nonetheless the operational and 
performance issues with SASL remain.

> Netty "leak detected" and OutOfDirectMemoryError due to direct memory 
> buffering
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-26708
>                 URL: https://issues.apache.org/jira/browse/HBASE-26708
>             Project: HBase
>          Issue Type: Bug
>          Components: rpc
>    Affects Versions: 2.5.0, 2.4.6
>            Reporter: Viraj Jasani
>            Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and 
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused 
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - java:115)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3] 
> util.ResourceLeakDetector - 
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>   
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>   
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>   
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>   
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   java.lang.Thread.run(Thread.java:748)
>  {code}
> And finally handlers are removed from the pipeline due to 
> OutOfDirectMemoryError:
> {code:java}
> 2022-01-25 17:36:28,657 WARN  [S-EventLoopGroup-1-5] 
> channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it 
> reached at the tail of the pipeline. It usually means the last handler in the 
> pipeline did not handle the exception.
> org.apache.hbase.thirdparty.io.netty.channel.ChannelPipelineException: 
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.handlerAdded()
>  has thrown an exception; removed.
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:624)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:181)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:358)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:339)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.saslNegotiate(NettyRpcConnection.java:229)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$600(NettyRpcConnection.java:79)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection$2.operationComplete(NettyRpcConnection.java:312)
>   at 
> org.apache.hadoop.hbase.ipc.NettyRpcConnection$2.operationComplete(NettyRpcConnection.java:300)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:653)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:691)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.hbase.thirdparty.io.netty.util.internal.OutOfDirectMemoryError: 
> failed to allocate 16777216 byte(s) of direct memory (used: 33269220801, max: 
> 33285996544)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802)
>   at 
> org.apache.hbase.thirdparty.io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:632)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:607)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:172)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:134)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:395)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
>   at 
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
>   at 
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.writeResponse(NettyHBaseSaslRpcClientHandler.java:79)
>   at 
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.handlerAdded(NettyHBaseSaslRpcClientHandler.java:115)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.callHandlerAdded(AbstractChannelHandlerContext.java:938)
>   at 
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:609)
>   ... 24 more
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to