Hey James,

I wrote Gremlins this spring when we were starting to test the new
durability features. I have a five node test cluster here, and I ran
Gremlins on one or two of the nodes, set up to fail a couple times an hour,
while also running some tests which insert and verify data. This found a
bunch of the bugs we fixed in HDFS as well as HBase, though there are still
some more lurking.

If you check out the code, you'll see it's really pretty simple - nothing
more than a sleep loop around some handy functions for killing things.

-Todd

On Tue, Aug 17, 2010 at 11:41 AM, James Kennedy <[email protected]>wrote:

> Hmm, yeah i waited well over the zk lease time for the HMaster to come out
> of the exception loop but it never did.
>
> Thanks, i'll follow up with a look at Gremlins.  How is it used within
> HBase? Do you continuously run gremlins on a cluster in an automated way?
>
> James Kennedy
> Project Manager
> Troove Inc.
>
> 1 877 330 8501
>
> On 2010-08-13, at 4:06 PM, Jean-Daniel Cryans wrote:
>
> > The master will get a Watcher event from ZooKeeper when the region
> > server's session is expired and its ephemeral znode is deleted. By
> > default the session timeout is really high to cope with users with
> > huge GC pauses problems, something like 1 minute (see
> > hbase-default.xml).
> >
> > For a good fault testing framework, please use
> > http://github.com/toddlipcon/gremlins. This was written by Todd Lipcon
> > to test HBase's handling of region servers' death.
> >
> > J-D
> >
> > On Fri, Aug 13, 2010 at 3:59 PM, James Kennedy <[email protected]>
> wrote:
> >> For our system it is critical that there be no data loss and fast
> recovery time if any node goes down.
> >>
> >> We've recently updated the hbase-transactional-tableindexed extension to
> work with the latest 0.89.20100726 version of HBase (still to be pushed).
> >> All HBase tests are passing but then when we started to write our own
> and test true sudden HRegionServer death we ran into trouble.
> >> It seems that the HMaster does not recognize the kill even after many
> minutes.  Client requests are blocked and the log continues to repeat the
> logs below.
> >>
> >> We realized that HBase's own tests that require RegionServer death use
> abort() and not kill() which does enough cleanup to inadequately simulate a
> sudden (e.g. JVM crash) death.
> >>
> >> As an experiment I made HRegionServer.kill() public and modified
> HBaseMiniCluster to call that from abort() instead.  Now a test like
> TestMasterTransitions will exhibit similar behaviour:  The HMaster never
> notices the RegionServer is gone.
> >>
> >> Could it really be that sudden region server death is not handled in
> hbase?
> >> Or more likely is this a failure of the testing framework to adequately
> simulate kill -9?
> >>
> >> James Kennedy
> >> Project Manager
> >> Troove Inc.
> >>
> >>
> >> -------------------------------
> >>
> >> [13/08/10 15:12:12] 259494 [n.serverMonitor] INFO
>  oop.hbase.master.ServerManager  - 2 region servers, 0 dead, average load
> 3.5
> >> [13/08/10 15:12:12] 259560 [ger.metaScanner] INFO
>  adoop.hbase.master.BaseScanner  - RegionManager.metaScanner scanning meta
> region {server: 10.0.1.4:56908, regionname: .META.,,1.1028785192,
> startKey: <>}
> >> [13/08/10 15:12:12] 259561 [ger.metaScanner] WARN
>  adoop.hbase.master.BaseScanner  - Scan one META region: {server:
> 10.0.1.4:56908, regionname: .META.,,1.1028785192, startKey: <>}
> >> java.net.ConnectException: Connection refused
> >>        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> >>        at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> >>        at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> >>        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> >>        at
> org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:309)
> >>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:857)
> >>        at
> org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:725)
> >>        at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:253)
> >>        at $Proxy10.openScanner(Unknown Source)
> >>        at
> org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:182)
> >>        at
> org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
> >>        at
> org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
> >>        at
> org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:156)
> >>        at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
> >>
> >>
> >>
> >>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to