The master will get a Watcher event from ZooKeeper when the region server's session is expired and its ephemeral znode is deleted. By default the session timeout is really high to cope with users with huge GC pauses problems, something like 1 minute (see hbase-default.xml).
For a good fault testing framework, please use http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon to test HBase's handling of region servers' death. J-D On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <[email protected]> wrote: > For our system it is critical that there be no data loss and fast recovery > time if any node goes down. > > We've recently updated the hbase-transactional-tableindexed extension to work > with the latest 0.89.20100726 version of HBase (still to be pushed). > All HBase tests are passing but then when we started to write our own and > test true sudden HRegionServer death we ran into trouble. > It seems that the HMaster does not recognize the kill even after many > minutes. Client requests are blocked and the log continues to repeat the > logs below. > > We realized that HBase's own tests that require RegionServer death use > abort() and not kill() which does enough cleanup to inadequately simulate a > sudden (e.g. JVM crash) death. > > As an experiment I made HRegionServer.kill() public and modified > HBaseMiniCluster to call that from abort() instead. Now a test like > TestMasterTransitions will exhibit similar behaviour: The HMaster never > notices the RegionServer is gone. > > Could it really be that sudden region server death is not handled in hbase? > Or more likely is this a failure of the testing framework to adequately > simulate kill -9? > > James Kennedy > Project Manager > Troove Inc. > > > ------------------------------- > > [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO > oop.hbase.master.ServerManager - 2 region servers, 0 dead, average load 3.5 > [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO > adoop.hbase.master.BaseScanner - RegionManager.metaScanner scanning meta > region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: > <>} > [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN > adoop.hbase.master.BaseScanner - Scan one META region: {server: > 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>} > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) > at > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > at > org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309) > at > org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857) > at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253) > at $Proxy10.openScanner(Unknown Source) > at > org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182) > at > org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) > at > org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) > at > org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156) > at org.apache.hadoop.hbase.Chore.run(Chore.java:68) > > > >
