[ 
https://issues.apache.org/jira/browse/HBASE-28156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17775969#comment-17775969
 ] 

Duo Zhang commented on HBASE-28156:
-----------------------------------

{quote}
I think we should use a separate EventLoopGroup for the server parent 
(acceptor). I also think we should fix our HWT timer to schedule prior to the 
event loop.
{quote}

No, these are all by design.

We should always try to share the same EventLoopGroup in the same process, for 
better share the same resources. If you find out that we can not consume all 
the CPUs, just increase the thread number in the EventLoopGroup, instead of 
introducing a new EventLoopGroup. And for canceling a request, it is also 
designed to be executed inside the channel handler, so there is no multi thread 
problems, which could simplify the logic a lot.

> Intra-process client connections cause netty EventLoop deadlock
> ---------------------------------------------------------------
>
>                 Key: HBASE-28156
>                 URL: https://issues.apache.org/jira/browse/HBASE-28156
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Bryan Beaudreault
>            Priority: Major
>
> We've had a few operational incidents over the past few months where our 
> HMaster stops accepting new connections, but can continue processing requests 
> from existing ones. Finally I was able to get heap and thread dumps to 
> confirm what's happening.
> The core trigger is HBASE-24687, where the MobFileCleanerChore is not using 
> ClusterConnection. I've prodded the linked PR to get that resolved and will 
> take it over if I don't hear soon.
> In this case, the chore is using the NettyRpcClient to make a local rpc call 
> to the same NettyRpcServer in the process. Due to 
> [NettyEventLoopGroupConfig|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/util/NettyEventLoopGroupConfig.java#L98],
>  we use the same EventLoopGroup for both the RPC Client and the RPC Server.
> What happens rarely is that the local client for MobFileCleanerChore gets 
> assigned to RS-EventLoopGroup-1-1. Since we share the EventLoopGroupConfig, 
> and [we don't specify a separate parent 
> group|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java#L155],
>  that group is also the group which processes new connections.
> What we see in this case is that RS-EventLoopGroup-1-1 gets hung in 
> Socket.accept. Since the client side is on the same EventLoop, it's tasks get 
> stuck in a queue waiting for the executor. So the client can't send the 
> request that the server Socket is waiting for.
> Further, the client/chore gets stuck waiting on BlockingRpcCallback.get(). We 
> use an HWT TimerTask to cancel overdue requests, but it only gets scheduled 
> [once NettyRpcConnection.sendRequest0 is 
> executed|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L371].
>  But sendRequest0 [executes on the 
> EventLoop|https://github.com/apache/hbase/blob/master/hbase-client/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcConnection.java#L393],
>  and thus gets similarly stuck. So we never schedule a timeout and the chore 
> gets stuck forever.
> While fixing HBASE-24687 will fix this case, I think we should improve our 
> netty configuration here so we can avoid problems like this if we ever do 
> intra-process RPC calls again (there may already be others, not sure).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to