For our system it is critical that there be no data loss and fast recovery time 
if any node goes down.

We've recently updated the hbase-transactional-tableindexed extension to work 
with the latest 0.89.20100726 version of HBase (still to be pushed).
All HBase tests are passing but then when we started to write our own and test 
true sudden HRegionServer death we ran into trouble.
It seems that the HMaster does not recognize the kill even after many minutes.  
Client requests are blocked and the log continues to repeat the logs below.

We realized that HBase's own tests that require RegionServer death use abort() 
and not kill() which does enough cleanup to inadequately simulate a sudden 
(e.g. JVM crash) death.

As an experiment I made HRegionServer.kill() public and modified 
HBaseMiniCluster to call that from abort() instead.  Now a test like 
TestMasterTransitions will exhibit similar behaviour:  The HMaster never 
notices the RegionServer is gone. 

Could it really be that sudden region server death is not handled in hbase?
Or more likely is this a failure of the testing framework to adequately 
simulate kill -9?

James Kennedy
Project Manager
Troove Inc.


-------------------------------

[13/08/10 15:12:12] 259494 [n.serverMonitor] INFO  
oop.hbase.master.ServerManager  - 2 region servers, 0 dead, average load 3.5
[13/08/10 15:12:12] 259560 [ger.metaScanner] INFO  
adoop.hbase.master.BaseScanner  - RegionManager.metaScanner scanning meta 
region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
[13/08/10 15:12:12] 259561 [ger.metaScanner] WARN  
adoop.hbase.master.BaseScanner  - Scan one META region: {server: 
10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
        at 
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
        at 
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
        at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
        at 
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
        at $Proxy10.openScanner(Unknown Source)
        at 
org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
        at 
org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
        at 
org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
        at 
org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)



Reply via email to