[
https://issues.apache.org/jira/browse/HBASE-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775903#comment-17775903
]
Bryan Beaudreault commented on HBASE-28156:
-------------------------------------------
I think we should use a separate EventLoopGroup for the server parent
(acceptor). I also think we should fix our HWT timer to schedule prior to the
event loop. I think it might still be possible for a server child task to get
blocked at that point, but not the acceptor. Do we need a server side hard
timeout as well?
> Intra-process client connections cause netty EventLoop deadlock
> ---------------------------------------------------------------
>
> Key: HBASE-28156
> URL: https://issues.apache.org/jira/browse/HBASE-28156
> Project: HBase
> Issue Type: Bug
> Reporter: Bryan Beaudreault
> Priority: Major
>
> We've had a few operational incidents over the past few months where our
> HMaster stops accepting new connections, but can continue processing requests
> from existing ones. Finally I was able to get heap and thread dumps to
> confirm what's happening.
> The core trigger is HBASE-24687, where the MobFileCleanerChore is not using
> ClusterConnection. I've prodded the linked PR to get that resolved and will
> take it over if I don't hear soon.
> In this case, the chore is using the NettyRpcClient to make a local rpc call
> to the same NettyRpcServer in the process. Due to
> [NettyEventLoopGroupConfig|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/NettyEventLoopGroupConfig.java#L98],
> we use the same EventLoopGroup for both the RPC Client and the RPC Server.
> What happens rarely is that the local client for MobFileCleanerChore gets
> assigned to RS-EventLoopGroup-1-1. Since we share the EventLoopGroupConfig,
> and [we don't specify a separate parent
> group|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java#L155],
> that group is also the group which processes new connections.
> What we see in this case is that RS-EventLoopGroup-1-1 gets hung in
> Socket.accept. Since the client side is on the same EventLoop, it's tasks get
> stuck in a queue waiting for the executor. So the client can't send the
> request that the server Socket is waiting for.
> Further, the client/chore gets stuck waiting on BlockingRpcCallback.get(). We
> use an HWT TimerTask to cancel overdue requests, but it only gets scheduled
> [once NettyRpcConnection.sendRequest0 is
> executed|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L371].
> But sendRequest0 [executes on the
> EventLoop|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L393],
> and thus gets similarly stuck. So we never schedule a timeout and the chore
> gets stuck forever.
> While fixing HBASE-24687 will fix this case, I think we should improve our
> netty configuration here so we can avoid problems like this if we ever do
> intra-process RPC calls again (there may already be others, not sure).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)