[ 
https://issues.apache.org/jira/browse/HBASE-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Kyle Purtell updated HBASE-27951:
----------------------------------------
    Description: 
Analysis of a recent production incident is not yet complete but an item of 
note is an apparent deadlock. Imagine you are gracefully draining a 
regionserver by way of a flurry of moveRegion requests. The handler for 
moveRegion submits a TRSP and then waits on its future without timeout. Imagine 
that there are sufficient number of moveRegion requests to tie up the normal 
priority master RPC pool. Now imagine that all of those requests are waiting on 
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails. 
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target 
regionserver terminated before responding to the close requests, blocking the 
moveRegion requests, blocking the RPC handlers. The regionserver restarts and 
tries to check in, but cannot report to the master because regionServerReport 
is a normal priority admin RPC and there are no free normal priority handlers 
to handle it. It seems not correct to have regionServerStartup and  
regionServerReport, which is important, contending with normal priority 
requests.

They should be made ADMIN_QOS priority to avoid this case. 

  was:
Analysis of a recent production incident is not yet complete but an item of 
note is an apparent deadlock. Imagine you are gracefully draining a 
regionserver by way of a flurry of moveRegion requests. The handler for 
moveRegion submits a TRSP and then waits on its future without timeout. Imagine 
that there are sufficient number of moveRegion requests to tie up the normal 
priority master RPC pool. Now imagine that all of those requests are waiting on 
TRSPs pending on a regionserver that is concurrently bounced or maybe it fails. 
The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE because the target 
regionserver terminated before responding to the close requests, blocking the 
moveRegion requests, blocking the RPC handlers. The regionserver restarts and 
tries to check in, but cannot report to the master because regionServerReport 
is a normal priority admin RPC and there are no free normal priority handlers 
to handle it. It seems not correct to have regionServerReport, which is 
important, contending with normal priority requests.

regionServerReport should be made ADMIN_QOS to avoid this case. 


> Use ADMIN_QOS in MasterRpcServices for regionServerStartup and 
> regionServerReport
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-27951
>                 URL: https://issues.apache.org/jira/browse/HBASE-27951
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.4.10
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
>
>
> Analysis of a recent production incident is not yet complete but an item of 
> note is an apparent deadlock. Imagine you are gracefully draining a 
> regionserver by way of a flurry of moveRegion requests. The handler for 
> moveRegion submits a TRSP and then waits on its future without timeout. 
> Imagine that there are sufficient number of moveRegion requests to tie up the 
> normal priority master RPC pool. Now imagine that all of those requests are 
> waiting on TRSPs pending on a regionserver that is concurrently bounced or 
> maybe it fails. The TRSPs are blocked in REGION_STATE_TRANSITION_CLOSE 
> because the target regionserver terminated before responding to the close 
> requests, blocking the moveRegion requests, blocking the RPC handlers. The 
> regionserver restarts and tries to check in, but cannot report to the master 
> because regionServerReport is a normal priority admin RPC and there are no 
> free normal priority handlers to handle it. It seems not correct to have 
> regionServerStartup and  regionServerReport, which is important, contending 
> with normal priority requests.
> They should be made ADMIN_QOS priority to avoid this case. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to