CT.verifyRegionLocation isn't doing a very good check, can delay cluster
recovery
---------------------------------------------------------------------------------
Key: HBASE-3984
URL: https://issues.apache.org/jira/browse/HBASE-3984
Project: HBase
Issue Type: Bug
Affects Versions: 0.90.3
Reporter: Jean-Daniel Cryans
Priority: Blocker
Fix For: 0.90.4
After some extensive debugging in the thread [A sudden msg of
"java.io.IOException: Server not running,
aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region
servers weren't able to talk to the new .META. location because the old one was
still alive but on it's way down after a OOME.
It translates into exceptions like "Server not running" coming from trying to
edit .META. and digging in the code I see that
CT.waitForMetaServerConnectionDefault -> waitForMeta ->
getMetaServerConnection(true) calls verifyRegionLocation since we force the
refresh. In this method we check if the RS is good by calling getRegionInfo
which *does not* check if the region server is trying to close.
What this means is that a cluster can't recover a .META.-serving RS failure
until it has fully shutdown since every time a RS tries to open a region (like
right after the log splitting) or split it fails editing .META.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira