Hmm, yeah i waited well over the zk lease time for the HMaster to come out of the exception loop but it never did.
Thanks, i'll follow up with a look at Gremlins. How is it used within HBase? Do you continuously run gremlins on a cluster in an automated way? James Kennedy Project Manager Troove Inc. 1 877 330 8501 On 2010-08-13, at 4:06 PM, Jean-Daniel Cryans wrote: > The master will get a Watcher event from ZooKeeper when the region > server's session is expired and its ephemeral znode is deleted. By > default the session timeout is really high to cope with users with > huge GC pauses problems, something like 1 minute (see > hbase-default.xml). > > For a good fault testing framework, please use > http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon > to test HBase's handling of region servers' death. > > J-D > > On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <[email protected]> > wrote: >> For our system it is critical that there be no data loss and fast recovery >> time if any node goes down. >> >> We've recently updated the hbase-transactional-tableindexed extension to >> work with the latest 0.89.20100726 version of HBase (still to be pushed). >> All HBase tests are passing but then when we started to write our own and >> test true sudden HRegionServer death we ran into trouble. >> It seems that the HMaster does not recognize the kill even after many >> minutes. Client requests are blocked and the log continues to repeat the >> logs below. >> >> We realized that HBase's own tests that require RegionServer death use >> abort() and not kill() which does enough cleanup to inadequately simulate a >> sudden (e.g. JVM crash) death. >> >> As an experiment I made HRegionServer.kill() public and modified >> HBaseMiniCluster to call that from abort() instead. Now a test like >> TestMasterTransitions will exhibit similar behaviour: The HMaster never >> notices the RegionServer is gone. >> >> Could it really be that sudden region server death is not handled in hbase? >> Or more likely is this a failure of the testing framework to adequately >> simulate kill -9? >> >> James Kennedy >> Project Manager >> Troove Inc. >> >> >> ------------------------------- >> >> [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO >> oop.hbase.master.ServerManager - 2 region servers, 0 dead, average load 3.5 >> [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO >> adoop.hbase.master.BaseScanner - RegionManager.metaScanner scanning meta >> region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: >> <>} >> [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN >> adoop.hbase.master.BaseScanner - Scan one META region: {server: >> 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>} >> java.net.ConnectException: Connection refused >> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) >> at >> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) >> at >> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309) >> at >> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857) >> at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725) >> at >> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253) >> at $Proxy10.openScanner(Unknown Source) >> at >> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182) >> at >> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73) >> at >> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129) >> at >> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156) >> at org.apache.hadoop.hbase.Chore.run(Chore.java:68) >> >> >> >>
