[
https://issues.apache.org/jira/browse/HBASE-10272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aditya Kishore updated HBASE-10272:
-----------------------------------
Attachment: HBASE-10272_0.94.patch
Patch for 0.94 branch.
> Cluster becomes in-operational if the node hosting the active Master AND
> ROOT/META table goes offline
> -----------------------------------------------------------------------------------------------------
>
> Key: HBASE-10272
> URL: https://issues.apache.org/jira/browse/HBASE-10272
> Project: HBase
> Issue Type: Bug
> Components: IPC/RPC
> Affects Versions: 0.94.15
> Reporter: Aditya Kishore
> Assignee: Aditya Kishore
> Priority: Critical
> Attachments: HBASE-10272_0.94.patch
>
>
> Since HBASE-6364, HBase client caches a connection failure to a server and
> any subsequent attempt to connect to the server throws a
> {{FailedServerException}}
> Now if a node which hosted the active Master AND ROOT/META table goes
> offline, the newly anointed Master's initial attempt to connect to the dead
> region server will fail with {{NoRouteToHostException}} which it handles but
> since on second attempt crashes with {{FailedServerException}}
> Here is the log from one such occurance
> {noformat}
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster: Master
> server abort: loaded coprocessors are: []
> 2013-11-20 10:58:00,161 FATAL org.apache.hadoop.hbase.master.HMaster:
> Unhandled exception. Starting shutdown.
> org.apache.hadoop.hbase.ipc.HBaseClient$FailedServerException: This server is
> in the failed servers list: xxx02/192.168.1.102:60020
> at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:425)
> at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1124)
> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:974)
> at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:86)
> at $Proxy9.getProtocolVersion(Unknown Source)
> at
> org.apache.hadoop.hbase.ipc.WritableRpcEngine.getProxy(WritableRpcEngine.java:138)
> at
> org.apache.hadoop.hbase.ipc.HBaseRPC.waitForProxy(HBaseRPC.java:208)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1335)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1294)
> at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getHRegionConnection(HConnectionManager.java:1281)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.getCachedConnection(CatalogTracker.java:506)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.getMetaServerConnection(CatalogTracker.java:383)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMeta(CatalogTracker.java:445)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.waitForMetaServerConnection(CatalogTracker.java:464)
> at
> org.apache.hadoop.hbase.catalog.CatalogTracker.verifyMetaRegionLocation(CatalogTracker.java:624)
> at
> org.apache.hadoop.hbase.master.HMaster.assignRootAndMeta(HMaster.java:684)
> at
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:560)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:376)
> at java.lang.Thread.run(Thread.java:662)
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.hbase.master.HMaster: Aborting
> 2013-11-20 10:58:00,162 INFO org.apache.hadoop.ipc.HBaseServer: Stopping
> server on 60000
> {noformat}
> Each of the backup master will crash with same error and restarting them will
> have the same effect. Once this happens, the cluster will remain
> in-operational until the node with region server is brought online (or the
> Zookeeper node containing the root region server and/or META entry from the
> ROOT table is deleted).
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)