Hello,

Does anyone have recent experience running Chaos Monkey? Are you running
against an external cluster, or one of the other modes? What monkey factory
are you using? Any property overrides? A non-default ClusterManager?

I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
much luck. My environment is an "external" cluster, 4 racks of 4 hosts
each, the relatively simple "serverKilling" factory with
`rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on
various scheduled, plus some balancer play mixed in; no process suspension.

Running for any length of time (~30 minutes) the chaos monkey eventually
terminates between a majority and all of the hosts in the cluster. My logs
are peppered with warnings such as the below. There are other variants. As
far as I can tell, actions are intended to cause some harm and then restore
state after themselves. In practice, the harm is successful but restoration
rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
timeout. The result is a methodical termination of the cluster.

So I'm curious if this matches others' experience running the monkey. For
example, do you have an environment more resilient than mine, one where an
external actor is restarting downed processed without the monkey action's
involvement? Is the monkey designed to run only in such an environment?
These timeouts are configurable; are you cranking them way up?

Any input you have would be greatly appreciated. This is my last major
action item blocking initial 2.3.0 release candidates.

Thanks,
Nick

20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
performing action: java.io.IOException: did timeout 60000ms waiting for
region server to start: host-a.example.com
        at
org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
        at
org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
        at
org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
        at
org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
        at
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
        at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
        at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
        at java.base/java.lang.Thread.run(Thread.java:834)

Reply via email to