Andrew Kyle Purtell created HBASE-27951:
-------------------------------------------

             Summary: Use ADMIN_QOS in MasterRpcServices for regionServerReport
                 Key: HBASE-27951
                 URL: https://issues.apache.org/jira/browse/HBASE-27951
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.4.10
            Reporter: Andrew Kyle Purtell
             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1


Analysis of a recent production incident is not yet complete but an item of 
note is an apparent deadlock. Imagine you are gracefully draining a 
regionserver by way of a flurry of moveRegion requests. The handler for 
moveRegion submits a TRSP and then waits on its future without timeout. Imagine 
that there are sufficient number of moveRegion requests to tie up the normal 
priority master RPC pool. Now imagine that all of those requests are waiting on 
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails. 
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target 
regionserver terminated before responding to the close requests, blocking the 
moveRegion requests, blocking the RPC handlers. The regionserver restarts and 
tries to check in, but cannot report to the master because regionServerReport 
is a normal priority admin RPC and there are no free normal priority handlers 
to handle it. 

regionServerReport should be made ADMIN_QOS to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to