Re: Recent experience with Chaos Monkey?

Zach York Thu, 07 May 2020 17:28:25 -0700

I should note that I was using HBase 2.2.3 to test.

On Thu, May 7, 2020 at 5:26 PM Zach York <[email protected]>
wrote:


> I recently ran ITBLL with Chaos monkey[1] against a real HBase
> installation (EMR). I initially tried to run it locally, but couldn't get
> it working and eventually gave up.
>
> > So I'm curious if this matches others' experience running the monkey. For
> example, do you have an environment more resilient than mine, one where an
> external actor is restarting downed processed without the monkey action's
> involvement?
>
> It actually performs even worse in this case in my experience since Chaos
> monkey can consider the failure mechanism to have failed (and eventually
> times out)
> because the process is too quick to recover (or the recovery fails because
> the process is already running). The only way I was able to get it to run
> was to disable
> the process that automatically restarts killed processes in my system.
>
> One other thing I hit was the validation for a suspended process was
> incorrect so if chaos monkey tried to suspend the process the run would
> fail. I'll put up a JIRA for that.
>
> This brings up a discussion on whether the ITBLL (or whatever process)
> should even continue if either a killing or recovering action failed. I
> would argue that invalidates the entire test,
> but it might not be obvious it failed unless you were watching the logs as
> it went.
>
> Thanks,
> Zach
>
>
> [1] sudo -u hbase hbase
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m serverKilling
> loop 4 2 1000000 ${RANDOM} 10
>
> On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk <[email protected]> wrote:
>
>> Hello,
>>
>> Does anyone have recent experience running Chaos Monkey? Are you running
>> against an external cluster, or one of the other modes? What monkey
>> factory
>> are you using? Any property overrides? A non-default ClusterManager?
>>
>> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
>> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
>> each, the relatively simple "serverKilling" factory with
>> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts
>> on
>> various scheduled, plus some balancer play mixed in; no process
>> suspension.
>>
>> Running for any length of time (~30 minutes) the chaos monkey eventually
>> terminates between a majority and all of the hosts in the cluster. My logs
>> are peppered with warnings such as the below. There are other variants. As
>> far as I can tell, actions are intended to cause some harm and then
>> restore
>> state after themselves. In practice, the harm is successful but
>> restoration
>> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
>> timeout. The result is a methodical termination of the cluster.
>>
>> So I'm curious if this matches others' experience running the monkey. For
>> example, do you have an environment more resilient than mine, one where an
>> external actor is restarting downed processed without the monkey action's
>> involvement? Is the monkey designed to run only in such an environment?
>> These timeouts are configurable; are you cranking them way up?
>>
>> Any input you have would be greatly appreciated. This is my last major
>> action item blocking initial 2.3.0 release candidates.
>>
>> Thanks,
>> Nick
>>
>> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
>> performing action: java.io.IOException: did timeout 60000ms waiting for
>> region server to start: host-a.example.com
>>         at
>>
>> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
>>         at
>> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
>>         at
>>
>> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
>>         at
>>
>> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
>>         at
>>
>> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
>>         at
>>
>> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
>>         at
>>
>> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
>>         at java.base/java.lang.Thread.run(Thread.java:834)
>>
>

Re: Recent experience with Chaos Monkey?

Reply via email to