Andrew Kyle Purtell created HBASE-27951:
-------------------------------------------
Summary: Use ADMIN_QOS in MasterRpcServices for regionServerReport
Key: HBASE-27951
URL: https://issues.apache.org/jira/browse/HBASE-27951
Project: HBase
Issue Type: Bug
Affects Versions: 2.4.10
Reporter: Andrew Kyle Purtell
Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
Analysis of a recent production incident is not yet complete but an item of
note is an apparent deadlock. Imagine you are gracefully draining a
regionserver by way of a flurry of moveRegion requests. The handler for
moveRegion submits a TRSP and then waits on its future without timeout. Imagine
that there are sufficient number of moveRegion requests to tie up the normal
priority master RPC pool. Now imagine that all of those requests are waiting on
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails.
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target
regionserver terminated before responding to the close requests, blocking the
moveRegion requests, blocking the RPC handlers. The regionserver restarts and
tries to check in, but cannot report to the master because regionServerReport
is a normal priority admin RPC and there are no free normal priority handlers
to handle it.
regionServerReport should be made ADMIN_QOS to avoid this case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)