[
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell reassigned HBASE-27951:
-------------------------------------------
Assignee: Andrew Kyle Purtell
> Use ADMIN_QOS in MasterRpcServices for regionServerReport
> ---------------------------------------------------------
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.10
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Analysis of a recent production incident is not yet complete but an item of
> note is an apparent deadlock. Imagine you are gracefully draining a
> regionserver by way of a flurry of moveRegion requests. The handler for
> moveRegion submits a TRSP and then waits on its future without timeout.
> Imagine that there are sufficient number of moveRegion requests to tie up the
> normal priority master RPC pool. Now imagine that all of those requests are
> waiting on TRSPs pending on a regionserver that is concurrently bounced or
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE
> because the target regionserver terminated before responding to the close
> requests, blocking the moveRegion requests, blocking the RPC handlers. The
> regionserver restarts and tries to check in, but cannot report to the master
> because regionServerReport is a normal priority admin RPC and there are no
> free normal priority handlers to handle it.
> regionServerReport should be made ADMIN_QOS to avoid this case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)