Aditya Kishore created HBASE-10272:
--------------------------------------

             Summary: Cluster becomes in-operational if the node hosting the 
active Master AND ROOT/META table goes offline
                 Key: HBASE-10272
                 URL: https://issues.apache.org/jira/browse/HBASE-10272
             Project: HBase
          Issue Type: Bug
          Components: IPC/RPC
    Affects Versions: 0.94.15
            Reporter: Aditya Kishore
            Assignee: Aditya Kishore
            Priority: Critical


Since HBASE-6364, HBase client caches a connection failure to a server and any 
subsequent attempt to connect to the server throws a {{FailedServerException}}

Now if a node which hosted the active Master AND ROOT/META table goes offline, 
the newly anointed Master's initial attempt to connect to the dead region 
server will fail with {{NoRouteToHostException}} which it handles but since on 
second attempt crashes with {{FailedServerException}}

Here is the log from one such occurance
{noformat}
2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master 
server abort: loaded coprocessors are: []
2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled 
exception. Starting shutdown.
org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is 
in the failed servers list: xxx02/192.168.1.102:60020
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
        at $Proxy9.getProtocolVersion(Unknown Source)
        at 
org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
        at org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
        at 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
        at 
org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
        at 
org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
        at 
org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
        at java.lang.Thread.run(Thread.java:662)
2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping server 
on 60000
{noformat}

Each of the backup master will crash with same error and restarting them will 
have the same effect. Once this happens, the cluster will remain in-operational 
until the node with region server is brought online (or the Zookeeper node 
containing the root region server and/or META entry from the ROOT table is 
deleted).



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to