[ 
https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14030935#comment-14030935
 ] 

Hudson commented on HBASE-10871:
--------------------------------

SUCCESS: Integrated in HBase-0.98 #334 (See 
[https://builds.apache.org/job/HBase-0.98/334/])
HBASE-10871 Indefinite OPEN/CLOSE wait on busy RegionServers (Esteban) (jxiang: 
rev 7ffc454ccc64f095d8992f03edeb3aacd83de92e)
* 
hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java


> Indefinite OPEN/CLOSE wait on busy RegionServers
> ------------------------------------------------
>
>                 Key: HBASE-10871
>                 URL: https://issues.apache.org/jira/browse/HBASE-10871
>             Project: HBase
>          Issue Type: Improvement
>          Components: Balancer, master, Region Assignment
>    Affects Versions: 0.94.6
>            Reporter: Harsh J
>            Assignee: Esteban Gutierrez
>             Fix For: 0.99.0, 0.94.21, 0.98.4
>
>         Attachments: HBASE-10871-0.94.v1.patch, HBASE-10871.v0.patch, 
> HBASE-10871.v1.patch
>
>
> We observed a case where, when a specific RS got bombarded by a large amount 
> of regular requests, spiking and filling up its RPC queue, the balancer's 
> invoked unassigns and assigns for regions that dealt with this server entered 
> into an indefinite retry loop.
> The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states 
> indefinitely cause of the HBase Client RPC from the ServerManager at the 
> master was running into SocketTimeouts. This caused a region unavailability 
> in the server for the affected regions. The timeout monitor retry default of 
> 30m in 0.94's AM compounded the waiting gap further a bit more (this is now 
> 10m in 0.95+'s new AM, and has further retries before we get there, which is 
> good).
> Wonder if there's a way to improve this situation generally. PENDING_OPENs 
> may be easy to handle - we can switch them out and move them elsewhere. 
> PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a 
> way to "give up" permanently on a movement plan, and letting things be for a 
> while hoping for the RS to recover itself on its own (such that clients also 
> have a chance of getting things to work in the meantime)?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to