[jira] [Created] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Jean-Daniel Cryans (JIRA) Mon, 13 Jun 2011 11:37:54 -0700

CT.verifyRegionLocation isn't doing a very good check, can delay cluster 
recovery
---------------------------------------------------------------------------------


                 Key: HBASE-3984
                 URL: https://issues.apache.org/jira/browse/HBASE-3984
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.90.3
            Reporter: Jean-Daniel Cryans
            Priority: Blocker
             Fix For: 0.90.4


After some extensive debugging in the thread [A sudden msg of 
"java.io.IOException: Server not running, 
aborting"|http://search-hadoop.com/m/Qb0BMnrTPZ1], we figured that the region 
servers weren't able to talk to the new .META. location because the old one was 
still alive but on it's way down after a OOME.

It translates into exceptions like "Server not running" coming from trying to 
edit .META. and digging in the code I see that 
CT.waitForMetaServerConnectionDefault -> waitForMeta -> 
getMetaServerConnection(true) calls verifyRegionLocation since we force the 
refresh. In this method we check if the RS is good by calling getRegionInfo 
which *does not* check if the region server is trying to close.

What this means is that a cluster can't recover a .META.-serving RS failure 
until it has fully shutdown since every time a RS tries to open a region (like 
right after the log splitting) or split it fails editing .META.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (HBASE-3984) CT.verifyRegionLocation isn't doing a very good check, can delay cluster recovery

Reply via email to