Hello, Does anyone have recent experience running Chaos Monkey? Are you running against an external cluster, or one of the other modes? What monkey factory are you using? Any property overrides? A non-default ClusterManager?
I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having much luck. My environment is an "external" cluster, 4 racks of 4 hosts each, the relatively simple "serverKilling" factory with `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on various scheduled, plus some balancer play mixed in; no process suspension. Running for any length of time (~30 minutes) the chaos monkey eventually terminates between a majority and all of the hosts in the cluster. My logs are peppered with warnings such as the below. There are other variants. As far as I can tell, actions are intended to cause some harm and then restore state after themselves. In practice, the harm is successful but restoration rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec timeout. The result is a methodical termination of the cluster. So I'm curious if this matches others' experience running the monkey. For example, do you have an environment more resilient than mine, one where an external actor is restarting downed processed without the monkey action's involvement? Is the monkey designed to run only in such an environment? These timeouts are configurable; are you cranking them way up? Any input you have would be greatly appreciated. This is my last major action item blocking initial 2.3.0 release candidates. Thanks, Nick 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during performing action: java.io.IOException: did timeout 60000ms waiting for region server to start: host-a.example.com at org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163) at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228) at org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70) at org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61) at org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50) at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) at java.base/java.lang.Thread.run(Thread.java:834)