[
https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14017368#comment-14017368
]
Esteban Gutierrez commented on HBASE-10871:
-------------------------------------------
[~jxiang] I ran into the same issue recently. Can we just let the master retry
the assignment in case of a {{java.net.SocketTimeoutException}}, e.g. just
remove {{return}}?
{code}
if (t instanceof java.net.SocketTimeoutException
&& this.serverManager.isServerOnline(plan.getDestination())) {
LOG.warn("Call openRegion() to " + plan.getDestination()
+ " has timed out when trying to assign "
+ region.getRegionNameAsString()
+ ", but the region might already be opened on "
+ plan.getDestination() + ".", t);
// return; <===
}
{code}
> Indefinite OPEN/CLOSE wait on busy RegionServers
> ------------------------------------------------
>
> Key: HBASE-10871
> URL: https://issues.apache.org/jira/browse/HBASE-10871
> Project: HBase
> Issue Type: Improvement
> Components: Balancer, master, Region Assignment
> Affects Versions: 0.94.6
> Reporter: Harsh J
>
> We observed a case where, when a specific RS got bombarded by a large amount
> of regular requests, spiking and filling up its RPC queue, the balancer's
> invoked unassigns and assigns for regions that dealt with this server entered
> into an indefinite retry loop.
> The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states
> indefinitely cause of the HBase Client RPC from the ServerManager at the
> master was running into SocketTimeouts. This caused a region unavailability
> in the server for the affected regions. The timeout monitor retry default of
> 30m in 0.94's AM compounded the waiting gap further a bit more (this is now
> 10m in 0.95+'s new AM, and has further retries before we get there, which is
> good).
> Wonder if there's a way to improve this situation generally. PENDING_OPENs
> may be easy to handle - we can switch them out and move them elsewhere.
> PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a
> way to "give up" permanently on a movement plan, and letting things be for a
> while hoping for the RS to recover itself on its own (such that clients also
> have a chance of getting things to work in the meantime)?
--
This message was sent by Atlassian JIRA
(v6.2#6252)