[
https://issues.apache.org/jira/browse/HBASE-26708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17552322#comment-17552322
]
Andrew Kyle Purtell edited comment on HBASE-26708 at 6/9/22 4:54 PM:
---------------------------------------------------------------------
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have
been testing with {_}auth{_}, which was the previous setting.
[~vjasani] I am curious if you apply my patch and set
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for
buffers at load driving the usage, and not excessive cache retention in the
pooled allocator. Let's see if experimental results confirm the hypothesis. If
it helps then I am wrong and pooling configuration tweaks – read on below –
should be considered. If I am correct then we should investigate how to get
direct IO buffers freed faster and/or limits or pacing applied to their
allocation; using a custom allocator, possibly.
Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue
people may be facing is confusion about system property names. I can see in the
sources, via my IDE, that the shader rewrote the string constants containing
the property keys too. Various resources on the Internet will offer
documentation and suggestions, but because we relocated Netty into thirdparty,
the names have changed, and so naively following the advice on StackOverflow
and other places will have no effect. Key here is recommendations when you want
to prefer heap instead of direct memory.
Let me list them in terms of relevancy for addressing this issue.
Highly relevant:
- io.netty.allocator.cacheTrimInterval ->
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
-- This is the number of threshold of allocations when cached entries will be
freed up if not frequently used. Lowering it from the default of 8192 may
reduce the overall amount of direct memory retained in steady state, because
the evaluation will be performed more often, as often as you specify.
- io.netty.noPreferDirect ->
org.apache.hbase.thirdparty.io.netty.noPreferDirect
-- This will prefer heap arena allocations regardless of PlatformDependent
ideas on preference if set to 'true'.
- io.netty.allocator.numDirectArenas ->
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
-- Various advice on the Internet suggests setting numDirectArenas=0 and
noPreferDirect=true as the way to prefer heap based buffers.
Less relevant:
- io.netty.allocator.maxCachedBufferCapacity ->
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
-- This is the sized based retention policy for buffers; individual buffers
larger than this will not be cached.
- io.netty.allocator.numHeapArenas ->
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
- io.netty.allocator.pageSize ->
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
- io.netty.allocator.maxOrder ->
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder
On [https://github.com/apache/hbase/pull/4505] I have a draft PR that allows
the user to tweak the Netty bytebuf allocation policy. This may be a good idea
to do in general. We may want to provide support for some of the above Netty
tunables in HBase site configuration as well, as a way to eliminate confusion
about them... Our documentation on it would describe the HBase site config
property names.
On a side note, we might spike on an alternative to SASL RPC that is a TLS
based implementation instead. I know this has been discussed and even partially
attempted, repeatedly, over our history but nonetheless the operational and
performance issues with SASL remain.
was (Author: apurtell):
[~zhangduo] Our current requirements would be _auth-conf_ but Viraj may have
been testing with _auth_, which was the previous setting.
[~vjasani] I am curious if you apply my patch and set
hbase.netty.rpcserver.allocator=unpooled if the direct memory allocation still
gets up to > 50 GB. My guess is yes, that it is the concurrent demand for
buffers at load driving the usage, and not excessive cache retention in the
pooled allocator. Let's see if experimental results confirm the hypothesis. If
it helps then I am wrong and pooling configuration tweaks -- read on below --
should be considered. If I am correct then we should investigate how to get
direct IO buffers freed faster and/or limits or pacing applied to their
allocation; using a custom allocator, possibly.
Looking at our PooledByteBufAllocator in hbase-thirdparty it is clear an issue
people may be facing is confusion about system property names. Various
resources on the Internet will offer documentation and suggestions, but because
we relocated Netty into thirdparty, the names have changed, and so naively
following the advice on StackOverflow and other places will have no effect. Key
here is recommendations when you want to prefer heap instead of direct memory.
Let me list them in terms of relevancy for addressing this issue.
Highly relevant:
- io.netty.allocator.cacheTrimInterval ->
org.apache.hbase.thirdparty.io.netty.allocator.cacheTrimInterval
-- This is the number of threshold of allocations when cached entries will be
freed up if not frequently used. Lowering it from the default of 8192 may
reduce the overall amount of direct memory retained in steady state, because
the evaluation will be performed more often, as often as you specify.
- io.netty.noPreferDirect -> org.apache.hbase.thirdparty.io.netty.noPreferDirect
-- This will prefer heap arena allocations regardless of PlatformDependent
ideas on preference if set to 'true'.
- io.netty.allocator.numDirectArenas ->
org.apache.hbase.thirdparty.io.netty.allocator.numDirectArenas
-- Various advice on the Internet suggests setting numDirectArenas=0 and
noPreferDirect=true as the way to prefer heap based buffers.
Less relevant:
- io.netty.allocator.maxCachedBufferCapacity ->
org.apache.hbase.thirdparty.io.netty.allocator.maxCachedBufferCapacity
-- This is the sized based retention policy for buffers; individual buffers
larger than this will not be cached.
- io.netty.allocator.numHeapArenas ->
org.apache.hbase.thirdparty.io.netty.allocator.numHeapArenas
- io.netty.allocator.pageSize ->
org.apache.hbase.thirdparty.io.netty.allocator.pageSize
- io.netty.allocator.maxOrder ->
org.apache.hbase.thirdparty.io.netty.allocator.maxOrder
On https://github.com/apache/hbase/pull/4505 I have a draft PR that allows the
user to tweak the Netty bytebuf allocation policy. This may be a good idea to
do in general. We may want to provide support for some of the above Netty
tunables in HBase site configuration as well, as a way to eliminate confusion
about them... Our documentation on it would describe the HBase site config
property names.
On a side note, we might spike on an alternative to SASL RPC that is a TLS
based implementation instead. I know this has been discussed and even partially
attempted, repeatedly, over our history but nonetheless the operational and
performance issues with SASL remain.
> Netty "leak detected" and OutOfDirectMemoryError due to direct memory
> buffering
> -------------------------------------------------------------------------------
>
> Key: HBASE-26708
> URL: https://issues.apache.org/jira/browse/HBASE-26708
> Project: HBase
> Issue Type: Bug
> Components: rpc
> Affects Versions: 2.5.0, 2.4.6
> Reporter: Viraj Jasani
> Priority: Critical
>
> Under constant data ingestion, using default Netty based RpcServer and
> RpcClient implementation results in OutOfDirectMemoryError, supposedly caused
> by leaks detected by Netty's LeakDetector.
> {code:java}
> 2022-01-25 17:03:10,084 ERROR [S-EventLoopGroup-1-3]
> util.ResourceLeakDetector - java:115)
>
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.expandCumulation(ByteToMessageDecoder.java:538)
>
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteToMessageDecoder.java:97)
>
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:274)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> java.lang.Thread.run(Thread.java:748)
> {code}
> {code:java}
> 2022-01-25 17:03:14,014 ERROR [S-EventLoopGroup-1-3]
> util.ResourceLeakDetector -
> apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
>
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446)
>
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
>
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
>
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
>
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> java.lang.Thread.run(Thread.java:748)
> {code}
> And finally handlers are removed from the pipeline due to
> OutOfDirectMemoryError:
> {code:java}
> 2022-01-25 17:36:28,657 WARN [S-EventLoopGroup-1-5]
> channel.DefaultChannelPipeline - An exceptionCaught() event was fired, and it
> reached at the tail of the pipeline. It usually means the last handler in the
> pipeline did not handle the exception.
> org.apache.hbase.thirdparty.io.netty.channel.ChannelPipelineException:
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.handlerAdded()
> has thrown an exception; removed.
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:624)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:181)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:358)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:339)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.saslNegotiate(NettyRpcConnection.java:229)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$600(NettyRpcConnection.java:79)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcConnection$2.operationComplete(NettyRpcConnection.java:312)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcConnection$2.operationComplete(NettyRpcConnection.java:300)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:653)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:691)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
> at
> org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> Caused by:
> org.apache.hbase.thirdparty.io.netty.util.internal.OutOfDirectMemoryError:
> failed to allocate 16777216 byte(s) of direct memory (used: 33269220801, max:
> 33285996544)
> at
> org.apache.hbase.thirdparty.io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802)
> at
> org.apache.hbase.thirdparty.io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:632)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:607)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:202)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:172)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:134)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:395)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
> at
> org.apache.hbase.thirdparty.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
> at
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.writeResponse(NettyHBaseSaslRpcClientHandler.java:79)
> at
> org.apache.hadoop.hbase.security.NettyHBaseSaslRpcClientHandler.handlerAdded(NettyHBaseSaslRpcClientHandler.java:115)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.callHandlerAdded(AbstractChannelHandlerContext.java:938)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:609)
> ... 24 more
> {code}
--
This message was sent by Atlassian Jira
(v8.20.7#820007)