[
https://issues.apache.org/jira/browse/HBASE-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17776197#comment-17776197
]
Bryan Beaudreault commented on HBASE-28156:
-------------------------------------------
Thanks for the input [~zhangduo]. I've uploaded a thread dump. It's not super
useful, only 2 threads of interest – {{{}RS-EventLoopGroup-1-1{}}}, which is
blocked in a native method, and
{{{}master/hmaster-foo-hb2-a-1-5c7944546c-vdzw4:60000.Chore.1{}}}, which is
blocked on a callback.
Worth noting that RS-EventLoopGroup-1-1 was blocked for over 12 hours without
doing anything. We enabled trace logging and that thread was silent. The Chore
thread was also blocked that entire time, no retries or timeouts were occurring.
I also have a heap dump which is what helped me identify that the 2 channels
were on the same EventLoop, and that the client side tasks were queued.
> Intra-process client connections cause netty EventLoop deadlock
> ---------------------------------------------------------------
>
> Key: HBASE-28156
> URL: https://issues.apache.org/jira/browse/HBASE-28156
> Project: HBase
> Issue Type: Bug
> Reporter: Bryan Beaudreault
> Priority: Major
> Attachments: hmaster-foo-hb2-a-1-5c7944546c-vdzw4.threads.5
>
>
> We've had a few operational incidents over the past few months where our
> HMaster stops accepting new connections, but can continue processing requests
> from existing ones. Finally I was able to get heap and thread dumps to
> confirm what's happening.
> The core trigger is HBASE-24687, where the MobFileCleanerChore is not using
> ClusterConnection. I've prodded the linked PR to get that resolved and will
> take it over if I don't hear soon.
> In this case, the chore is using the NettyRpcClient to make a local rpc call
> to the same NettyRpcServer in the process. Due to
> [NettyEventLoopGroupConfig|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/NettyEventLoopGroupConfig.java#L98],
> we use the same EventLoopGroup for both the RPC Client and the RPC Server.
> What happens rarely is that the local client for MobFileCleanerChore gets
> assigned to RS-EventLoopGroup-1-1. Since we share the EventLoopGroupConfig,
> and [we don't specify a separate parent
> group|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java#L155],
> that group is also the group which processes new connections.
> What we see in this case is that RS-EventLoopGroup-1-1 gets hung in
> Socket.accept. Since the client side is on the same EventLoop, it's tasks get
> stuck in a queue waiting for the executor. So the client can't send the
> request that the server Socket is waiting for.
> Further, the client/chore gets stuck waiting on BlockingRpcCallback.get(). We
> use an HWT TimerTask to cancel overdue requests, but it only gets scheduled
> [once NettyRpcConnection.sendRequest0 is
> executed|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L371].
> But sendRequest0 [executes on the
> EventLoop|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L393],
> and thus gets similarly stuck. So we never schedule a timeout and the chore
> gets stuck forever.
> While fixing HBASE-24687 will fix this case, I think we should improve our
> netty configuration here so we can avoid problems like this if we ever do
> intra-process RPC calls again (there may already be others, not sure).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)