Hmm, yeah i waited well over the zk lease time for the HMaster to come out of 
the exception loop but it never did.

Thanks, i'll follow up with a look at Gremlins.  How is it used within HBase? 
Do you continuously run gremlins on a cluster in an automated way?

James Kennedy
Project Manager
Troove Inc.

1 877 330 8501

On 2010-08-13, at 4:06 PM, Jean-Daniel Cryans wrote:

> The master will get a Watcher event from ZooKeeper when the region
> server's session is expired and its ephemeral znode is deleted. By
> default the session timeout is really high to cope with users with
> huge GC pauses problems, something like 1 minute (see
> hbase-default.xml).
> 
> For a good fault testing framework, please use
> http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon
> to test HBase's handling of region servers' death.
> 
> J-D
> 
> On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <[email protected]> 
> wrote:
>> For our system it is critical that there be no data loss and fast recovery 
>> time if any node goes down.
>> 
>> We've recently updated the hbase-transactional-tableindexed extension to 
>> work with the latest 0.89.20100726 version of HBase (still to be pushed).
>> All HBase tests are passing but then when we started to write our own and 
>> test true sudden HRegionServer death we ran into trouble.
>> It seems that the HMaster does not recognize the kill even after many 
>> minutes.  Client requests are blocked and the log continues to repeat the 
>> logs below.
>> 
>> We realized that HBase's own tests that require RegionServer death use 
>> abort() and not kill() which does enough cleanup to inadequately simulate a 
>> sudden (e.g. JVM crash) death.
>> 
>> As an experiment I made HRegionServer.kill() public and modified 
>> HBaseMiniCluster to call that from abort() instead.  Now a test like 
>> TestMasterTransitions will exhibit similar behaviour:  The HMaster never 
>> notices the RegionServer is gone.
>> 
>> Could it really be that sudden region server death is not handled in hbase?
>> Or more likely is this a failure of the testing framework to adequately 
>> simulate kill -9?
>> 
>> James Kennedy
>> Project Manager
>> Troove Inc.
>> 
>> 
>> -------------------------------
>> 
>> [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO  
>> oop.hbase.master.ServerManager  - 2 region servers, 0 dead, average load 3.5
>> [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO  
>> adoop.hbase.master.BaseScanner  - RegionManager.metaScanner scanning meta 
>> region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: 
>> <>}
>> [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN  
>> adoop.hbase.master.BaseScanner  - Scan one META region: {server: 
>> 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>>        at 
>> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
>>        at 
>> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
>>        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
>>        at 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
>>        at $Proxy10.openScanner(Unknown Source)
>>        at 
>> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
>>        at 
>> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
>>        at 
>> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
>>        at 
>> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
>>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
>> 
>> 
>> 
>> 

Reply via email to