[
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell updated HBASE-27951:
----------------------------------------
Description:
Analysis of a recent production incident is not yet complete but an item of
note is an apparent deadlock. Imagine you are gracefully draining a
regionserver by way of a flurry of moveRegion requests. The handler for
moveRegion submits a TRSP and then waits on its future without timeout. Imagine
that there are sufficient number of moveRegion requests to tie up the normal
priority master RPC pool. Now imagine that all of those requests are waiting on
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails.
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target
regionserver terminated before responding to the close requests, blocking the
moveRegion requests, blocking the RPC handlers. The regionserver restarts and
tries to check in, but cannot report to the master because there are no free
normal priority handlers to handle it. It seems not correct to have the
regionserver operational dependencies (regionServerStartup, regionServerReport,
and reportFatalRSError) contending with normal priority requests.
They should be made ADMIN_QOS priority to avoid this case.
was:
Analysis of a recent production incident is not yet complete but an item of
note is an apparent deadlock. Imagine you are gracefully draining a
regionserver by way of a flurry of moveRegion requests. The handler for
moveRegion submits a TRSP and then waits on its future without timeout. Imagine
that there are sufficient number of moveRegion requests to tie up the normal
priority master RPC pool. Now imagine that all of those requests are waiting on
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails.
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target
regionserver terminated before responding to the close requests, blocking the
moveRegion requests, blocking the RPC handlers. The regionserver restarts and
tries to check in, but cannot report to the master because regionServerReport
is a normal priority admin RPC and there are no free normal priority handlers
to handle it. It seems not correct to have regionServerStartup and
regionServerReport, which is important, contending with normal priority
requests.
They should be made ADMIN_QOS priority to avoid this case.
> Use ADMIN_QOS in MasterRpcServices for regionserver operational dependencies
> ----------------------------------------------------------------------------
>
> Key: HBASE-27951
> URL: https://issues.apache.org/jira/browse/HBASE-27951
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.4.10
> Reporter: Andrew Kyle Purtell
> Assignee: Andrew Kyle Purtell
> Priority: Major
> Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Analysis of a recent production incident is not yet complete but an item of
> note is an apparent deadlock. Imagine you are gracefully draining a
> regionserver by way of a flurry of moveRegion requests. The handler for
> moveRegion submits a TRSP and then waits on its future without timeout.
> Imagine that there are sufficient number of moveRegion requests to tie up the
> normal priority master RPC pool. Now imagine that all of those requests are
> waiting on TRSPs pending on a regionserver that is concurrently bounced or
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE
> because the target regionserver terminated before responding to the close
> requests, blocking the moveRegion requests, blocking the RPC handlers. The
> regionserver restarts and tries to check in, but cannot report to the master
> because there are no free normal priority handlers to handle it. It seems not
> correct to have the regionserver operational dependencies
> (regionServerStartup, regionServerReport, and reportFatalRSError) contending
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)