[
https://issues.apache.org/jira/browse/HBASE-10871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13970255#comment-13970255
]
Jean-Daniel Cryans commented on HBASE-10871:
--------------------------------------------
I got privy with the original issue's details, and I don't think that this part
is accurate:
bq. the balancer's invoked unassigns and assigns for regions that dealt with
this server entered into an indefinite retry loop.
What I saw from the logs is that the master got a {{SocketTimeoutException}}
trying to send the openRegion() to the region server and fell into this part of
the code in {{AssignmentManager}}:
{code}
} else if (t instanceof java.net.SocketTimeoutException
&& this.serverManager.isServerOnline(plan.getDestination())) {
LOG.warn("Call openRegion() to " + plan.getDestination()
+ " has timed out when trying to assign "
+ region.getRegionNameAsString()
+ ", but the region might already be opened on "
+ plan.getDestination() + ".", t);
return;
{code}
At that point we don't know if the region was opened or not, so we stop keeping
track of it. Unfortunately, the 0.94 code has a default of 30 minutes for RIT
timeout so it took that long to finally see this message in the log: "Region
has been OFFLINE for too long, reassigning regionx to a random server". The
looping isn't indefinite, it's just that a few of them were getting the socket
timeout repeatedly. Some machines were in a bad state. There was also something
strange with the call queues, I didn't see any mention of a connection reset so
it doesn't seem like openRegion even made it to the queue.
So it behaved the way we expect it to, although it's really not ideal, and
stayed in limbo in either OFFLINE, PENDING_OPEN, or PENDING_CLOSE. Intuitively,
I'd say we just ask the server directly if it has the region or not, but the
fact that we got a socket timeout in the first place kinda tells us that that
RS is currently unreachable. Maybe it's going to die, which is good since we'll
know for sure the region is closed, but if it was opened then it might already
be serving requests.
I can think of a concept of having a list of regions in that special state that
for every X seconds we try to contact the region server, but it seems brittle.
FWIW the current situation of relying on the timeout isn't great either since
the region could be open on the RS but we timeout anyways and double assign it.
Pinging [~jxiang] since he loves that part of the code for more inputs :)
> Indefinite OPEN/CLOSE wait on busy RegionServers
> ------------------------------------------------
>
> Key: HBASE-10871
> URL: https://issues.apache.org/jira/browse/HBASE-10871
> Project: HBase
> Issue Type: Improvement
> Components: Balancer, master, Region Assignment
> Affects Versions: 0.94.6
> Reporter: Harsh J
>
> We observed a case where, when a specific RS got bombarded by a large amount
> of regular requests, spiking and filling up its RPC queue, the balancer's
> invoked unassigns and assigns for regions that dealt with this server entered
> into an indefinite retry loop.
> The regions specifically began waiting in PENDING_CLOSE/PENDING_OPEN states
> indefinitely cause of the HBase Client RPC from the ServerManager at the
> master was running into SocketTimeouts. This caused a region unavailability
> in the server for the affected regions. The timeout monitor retry default of
> 30m in 0.94's AM compounded the waiting gap further a bit more (this is now
> 10m in 0.95+'s new AM, and has further retries before we get there, which is
> good).
> Wonder if there's a way to improve this situation generally. PENDING_OPENs
> may be easy to handle - we can switch them out and move them elsewhere.
> PENDING_CLOSEs may be a bit more tricky, but there must perhaps at least be a
> way to "give up" permanently on a movement plan, and letting things be for a
> while hoping for the RS to recover itself on its own (such that clients also
> have a chance of getting things to work in the meantime)?
--
This message was sent by Atlassian JIRA
(v6.2#6252)