[ 
https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975711#comment-13975711
 ] 

Jimmy Xiang commented on HBASE-10871:
-------------------------------------

[~qwertymaniac], can you check if the timeout monitor is enabled? If so, how 
long is the timeout?  For region assignments, a different RPC queue is used 
because they are high priority calls. Have you changed the meta handler count? 
By default it is 10. So it should be able to handle 10 open/close requests at 
the same time if not for other issues. Do we know how many regions were on that 
server?

> Indefinite OPEN/CLOSE wait on busy RegionServers
> ------------------------------------------------
>
>                 Key: HBASE-10871
>                 URL: https://issues.apache.org/jira/browse/HBASE-10871
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, master, Region Assignment
>    Affects Versions: 0.94.6
>            Reporter: Harsh J
>            Assignee: Jimmy Xiang
>
> We observed a case where, when a specific RS got bombarded by a large amount 
> of regular requests, spiking and filling up its RPC queue, the balancer's 
> invoked unassigns and assigns for regions that dealt with this server entered 
> into an indefinite retry loop.
> The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states 
> indefinitely cause of the HBase Client RPC from the ServerManager at the 
> master was running into SocketTimeouts. This caused a region unavailability 
> in the server for the affected regions. The timeout monitor retry default of 
> 30m in 0.94's AM compounded the waiting gap further a bit more (this is now 
> 10m in 0.95+'s new AM, and has further retries before we get there, which is 
> good).
> Wonder if there's a way to improve this situation generally. PENDING_OPENs 
> may be easy to handle - we can switch them out and move them elsewhere. 
> PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a 
> way to "give up" permanently on a movement plan, and letting things be for a 
> while hoping for the RS to recover itself on its own (such that clients also 
> have a chance of getting things to work in the meantime)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to