Re: Recent experience with Chaos Monkey?

Nick Dimiduk Tue, 12 May 2020 11:53:18 -0700

Thanks Zach.

> It actually performs even worse in this case in my experience since Chaos
monkey can consider the failure mechanism to have failed (and eventually
times out) because the process is too quick to recover (or the recovery
fails because the process is already running). The only way I was able to
get it to run was to disable the process that automatically restarts killed
processes in my system.


Interesting observation.

> This brings up a discussion on whether the ITBLL (or whatever process)
should even continue if either a killing or recovering action failed.
I would argue that invalidates the entire test, but it might not be obvious
it failed unless you were watching the logs as it went.

I'm coming to a similar conclusion -- failure in the orchestration layer
should invalidate the test.

On Thu, May 7, 2020 at 5:27 PM Zach York <zyork.contribut...@gmail.com>
wrote:

> I should note that I was using HBase 2.2.3 to test.
>
> On Thu, May 7, 2020 at 5:26 PM Zach York <zyork.contribut...@gmail.com>
> wrote:
>
> > I recently ran ITBLL with Chaos monkey[1] against a real HBase
> > installation (EMR). I initially tried to run it locally, but couldn't get
> > it working and eventually gave up.
> >
> > > So I'm curious if this matches others' experience running the monkey.
> For
> > example, do you have an environment more resilient than mine, one where
> an
> > external actor is restarting downed processed without the monkey action's
> > involvement?
> >
> > It actually performs even worse in this case in my experience since Chaos
> > monkey can consider the failure mechanism to have failed (and eventually
> > times out)
> > because the process is too quick to recover (or the recovery fails
> because
> > the process is already running). The only way I was able to get it to run
> > was to disable
> > the process that automatically restarts killed processes in my system.
> >
> > One other thing I hit was the validation for a suspended process was
> > incorrect so if chaos monkey tried to suspend the process the run would
> > fail. I'll put up a JIRA for that.
> >
> > This brings up a discussion on whether the ITBLL (or whatever process)
> > should even continue if either a killing or recovering action failed. I
> > would argue that invalidates the entire test,
> > but it might not be obvious it failed unless you were watching the logs
> as
> > it went.
> >
> > Thanks,
> > Zach
> >
> >
> > [1] sudo -u hbase hbase
> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m
> serverKilling
> > loop 4 2 1000000 ${RANDOM} 10
> >
> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk <ndimi...@apache.org> wrote:
> >
> >> Hello,
> >>
> >> Does anyone have recent experience running Chaos Monkey? Are you running
> >> against an external cluster, or one of the other modes? What monkey
> >> factory
> >> are you using? Any property overrides? A non-default ClusterManager?
> >>
> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
> >> each, the relatively simple "serverKilling" factory with
> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts
> >> on
> >> various scheduled, plus some balancer play mixed in; no process
> >> suspension.
> >>
> >> Running for any length of time (~30 minutes) the chaos monkey eventually
> >> terminates between a majority and all of the hosts in the cluster. My
> logs
> >> are peppered with warnings such as the below. There are other variants.
> As
> >> far as I can tell, actions are intended to cause some harm and then
> >> restore
> >> state after themselves. In practice, the harm is successful but
> >> restoration
> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
> >> timeout. The result is a methodical termination of the cluster.
> >>
> >> So I'm curious if this matches others' experience running the monkey.
> For
> >> example, do you have an environment more resilient than mine, one where
> an
> >> external actor is restarting downed processed without the monkey
> action's
> >> involvement? Is the monkey designed to run only in such an environment?
> >> These timeouts are configurable; are you cranking them way up?
> >>
> >> Any input you have would be greatly appreciated. This is my last major
> >> action item blocking initial 2.3.0 release candidates.
> >>
> >> Thanks,
> >> Nick
> >>
> >> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
> >> performing action: java.io.IOException: did timeout 60000ms waiting for
> >> region server to start: host-a.example.com
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
> >>         at
> >> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
> >>         at
> >>
> >>
> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
> >>         at java.base/java.lang.Thread.run(Thread.java:834)
> >>
> >
>

Re: Recent experience with Chaos Monkey?

Reply via email to