[
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-27951:
----------------------------------------
Hadoop Flags: Reviewed
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> ----------------------------------------------------------------------------
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.10
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1, 4.0.0-alpha-1
>
>
> Analysis of a recent production incident is not yet complete but an item of
> note is an apparent deadlock. Imagine you are gracefully draining a
> regionserver by way of a flurry of moveRegion requests. The handler for
> moveRegion submits a TRSP and then waits on its future without timeout.
> Imagine that there are sufficient number of moveRegion requests to tie up the
> normal priority master RPC pool. Now imagine that all of those requests are
> waiting on TRSPs pending on a regionserver that is concurrently bounced or
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE
> because the target regionserver terminated before responding to the close
> requests, blocking the moveRegion requests, blocking the RPC handlers. The
> regionserver restarts and tries to check in, but cannot report to the master
> because there are no free normal priority handlers to handle it. It seems not
> correct to have the regionserver operational dependencies
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)