Re: Recent experience with Chaos Monkey?

2020-05-13 Thread Nick Dimiduk
To follow up, I've needed to apply these two patches to get my local
environment running.

https://issues.apache.org/jira/browse/HBASE-24360
https://issues.apache.org/jira/browse/HBASE-24361

On Tue, May 12, 2020 at 11:52 AM Nick Dimiduk  wrote:

> Thanks Zach.
>
> > It actually performs even worse in this case in my experience since
> Chaos monkey can consider the failure mechanism to have failed (and
> eventually times out) because the process is too quick to recover (or the
> recovery fails because the process is already running). The only way I was
> able to get it to run was to disable the process that automatically
> restarts killed processes in my system.
>
> Interesting observation.
>
> > This brings up a discussion on whether the ITBLL (or whatever process)
> should even continue if either a killing or recovering action failed.
> I would argue that invalidates the entire test, but it might not be obvious
> it failed unless you were watching the logs as it went.
>
> I'm coming to a similar conclusion -- failure in the orchestration layer
> should invalidate the test.
>
> On Thu, May 7, 2020 at 5:27 PM Zach York 
> wrote:
>
>> I should note that I was using HBase 2.2.3 to test.
>>
>> On Thu, May 7, 2020 at 5:26 PM Zach York 
>> wrote:
>>
>> > I recently ran ITBLL with Chaos monkey[1] against a real HBase
>> > installation (EMR). I initially tried to run it locally, but couldn't
>> get
>> > it working and eventually gave up.
>> >
>> > > So I'm curious if this matches others' experience running the monkey.
>> For
>> > example, do you have an environment more resilient than mine, one where
>> an
>> > external actor is restarting downed processed without the monkey
>> action's
>> > involvement?
>> >
>> > It actually performs even worse in this case in my experience since
>> Chaos
>> > monkey can consider the failure mechanism to have failed (and eventually
>> > times out)
>> > because the process is too quick to recover (or the recovery fails
>> because
>> > the process is already running). The only way I was able to get it to
>> run
>> > was to disable
>> > the process that automatically restarts killed processes in my system.
>> >
>> > One other thing I hit was the validation for a suspended process was
>> > incorrect so if chaos monkey tried to suspend the process the run would
>> > fail. I'll put up a JIRA for that.
>> >
>> > This brings up a discussion on whether the ITBLL (or whatever process)
>> > should even continue if either a killing or recovering action failed. I
>> > would argue that invalidates the entire test,
>> > but it might not be obvious it failed unless you were watching the logs
>> as
>> > it went.
>> >
>> > Thanks,
>> > Zach
>> >
>> >
>> > [1] sudo -u hbase hbase
>> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m
>> serverKilling
>> > loop 4 2 100 ${RANDOM} 10
>> >
>> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk 
>> wrote:
>> >
>> >> Hello,
>> >>
>> >> Does anyone have recent experience running Chaos Monkey? Are you
>> running
>> >> against an external cluster, or one of the other modes? What monkey
>> >> factory
>> >> are you using? Any property overrides? A non-default ClusterManager?
>> >>
>> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not
>> having
>> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
>> >> each, the relatively simple "serverKilling" factory with
>> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various
>> hosts
>> >> on
>> >> various scheduled, plus some balancer play mixed in; no process
>> >> suspension.
>> >>
>> >> Running for any length of time (~30 minutes) the chaos monkey
>> eventually
>> >> terminates between a majority and all of the hosts in the cluster. My
>> logs
>> >> are peppered with warnings such as the below. There are other
>> variants. As
>> >> far as I can tell, actions are intended to cause some harm and then
>> >> restore
>> >> state after themselves. In practice, the harm is successful but
>> >> restoration
>> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
>> >

Re: Recent experience with Chaos Monkey?

2020-05-12 Thread Nick Dimiduk
Thanks Zach.

> It actually performs even worse in this case in my experience since Chaos
monkey can consider the failure mechanism to have failed (and eventually
times out) because the process is too quick to recover (or the recovery
fails because the process is already running). The only way I was able to
get it to run was to disable the process that automatically restarts killed
processes in my system.

Interesting observation.

> This brings up a discussion on whether the ITBLL (or whatever process)
should even continue if either a killing or recovering action failed.
I would argue that invalidates the entire test, but it might not be obvious
it failed unless you were watching the logs as it went.

I'm coming to a similar conclusion -- failure in the orchestration layer
should invalidate the test.

On Thu, May 7, 2020 at 5:27 PM Zach York 
wrote:

> I should note that I was using HBase 2.2.3 to test.
>
> On Thu, May 7, 2020 at 5:26 PM Zach York 
> wrote:
>
> > I recently ran ITBLL with Chaos monkey[1] against a real HBase
> > installation (EMR). I initially tried to run it locally, but couldn't get
> > it working and eventually gave up.
> >
> > > So I'm curious if this matches others' experience running the monkey.
> For
> > example, do you have an environment more resilient than mine, one where
> an
> > external actor is restarting downed processed without the monkey action's
> > involvement?
> >
> > It actually performs even worse in this case in my experience since Chaos
> > monkey can consider the failure mechanism to have failed (and eventually
> > times out)
> > because the process is too quick to recover (or the recovery fails
> because
> > the process is already running). The only way I was able to get it to run
> > was to disable
> > the process that automatically restarts killed processes in my system.
> >
> > One other thing I hit was the validation for a suspended process was
> > incorrect so if chaos monkey tried to suspend the process the run would
> > fail. I'll put up a JIRA for that.
> >
> > This brings up a discussion on whether the ITBLL (or whatever process)
> > should even continue if either a killing or recovering action failed. I
> > would argue that invalidates the entire test,
> > but it might not be obvious it failed unless you were watching the logs
> as
> > it went.
> >
> > Thanks,
> > Zach
> >
> >
> > [1] sudo -u hbase hbase
> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m
> serverKilling
> > loop 4 2 100 ${RANDOM} 10
> >
> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk  wrote:
> >
> >> Hello,
> >>
> >> Does anyone have recent experience running Chaos Monkey? Are you running
> >> against an external cluster, or one of the other modes? What monkey
> >> factory
> >> are you using? Any property overrides? A non-default ClusterManager?
> >>
> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
> >> each, the relatively simple "serverKilling" factory with
> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts
> >> on
> >> various scheduled, plus some balancer play mixed in; no process
> >> suspension.
> >>
> >> Running for any length of time (~30 minutes) the chaos monkey eventually
> >> terminates between a majority and all of the hosts in the cluster. My
> logs
> >> are peppered with warnings such as the below. There are other variants.
> As
> >> far as I can tell, actions are intended to cause some harm and then
> >> restore
> >> state after themselves. In practice, the harm is successful but
> >> restoration
> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
> >> timeout. The result is a methodical termination of the cluster.
> >>
> >> So I'm curious if this matches others' experience running the monkey.
> For
> >> example, do you have an environment more resilient than mine, one where
> an
> >> external actor is restarting downed processed without the monkey
> action's
> >> involvement? Is the monkey designed to run only in such an environment?
> >> These timeouts are configurable; are you cranking them way up?
> >>
> >> Any input you have would be greatly appreciated. This is my last major
> >> action item blocking initi

Re: Recent experience with Chaos Monkey?

2020-05-07 Thread Zach York
I should note that I was using HBase 2.2.3 to test.

On Thu, May 7, 2020 at 5:26 PM Zach York 
wrote:

> I recently ran ITBLL with Chaos monkey[1] against a real HBase
> installation (EMR). I initially tried to run it locally, but couldn't get
> it working and eventually gave up.
>
> > So I'm curious if this matches others' experience running the monkey. For
> example, do you have an environment more resilient than mine, one where an
> external actor is restarting downed processed without the monkey action's
> involvement?
>
> It actually performs even worse in this case in my experience since Chaos
> monkey can consider the failure mechanism to have failed (and eventually
> times out)
> because the process is too quick to recover (or the recovery fails because
> the process is already running). The only way I was able to get it to run
> was to disable
> the process that automatically restarts killed processes in my system.
>
> One other thing I hit was the validation for a suspended process was
> incorrect so if chaos monkey tried to suspend the process the run would
> fail. I'll put up a JIRA for that.
>
> This brings up a discussion on whether the ITBLL (or whatever process)
> should even continue if either a killing or recovering action failed. I
> would argue that invalidates the entire test,
> but it might not be obvious it failed unless you were watching the logs as
> it went.
>
> Thanks,
> Zach
>
>
> [1] sudo -u hbase hbase
> org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m serverKilling
> loop 4 2 100 ${RANDOM} 10
>
> On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk  wrote:
>
>> Hello,
>>
>> Does anyone have recent experience running Chaos Monkey? Are you running
>> against an external cluster, or one of the other modes? What monkey
>> factory
>> are you using? Any property overrides? A non-default ClusterManager?
>>
>> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
>> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
>> each, the relatively simple "serverKilling" factory with
>> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts
>> on
>> various scheduled, plus some balancer play mixed in; no process
>> suspension.
>>
>> Running for any length of time (~30 minutes) the chaos monkey eventually
>> terminates between a majority and all of the hosts in the cluster. My logs
>> are peppered with warnings such as the below. There are other variants. As
>> far as I can tell, actions are intended to cause some harm and then
>> restore
>> state after themselves. In practice, the harm is successful but
>> restoration
>> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
>> timeout. The result is a methodical termination of the cluster.
>>
>> So I'm curious if this matches others' experience running the monkey. For
>> example, do you have an environment more resilient than mine, one where an
>> external actor is restarting downed processed without the monkey action's
>> involvement? Is the monkey designed to run only in such an environment?
>> These timeouts are configurable; are you cranking them way up?
>>
>> Any input you have would be greatly appreciated. This is my last major
>> action item blocking initial 2.3.0 release candidates.
>>
>> Thanks,
>> Nick
>>
>> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
>> performing action: java.io.IOException: did timeout 6ms waiting for
>> region server to start: host-a.example.com
>> at
>>
>> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
>> at
>> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
>> at
>>
>> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
>> at
>>
>> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
>> at
>>
>> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
>> at
>>
>> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
>> at
>>
>> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
>> at java.base/java.lang.Thread.run(Thread.java:834)
>>
>


Re: Recent experience with Chaos Monkey?

2020-05-07 Thread Zach York
I recently ran ITBLL with Chaos monkey[1] against a real HBase installation
(EMR). I initially tried to run it locally, but couldn't get it working and
eventually gave up.

> So I'm curious if this matches others' experience running the monkey. For
example, do you have an environment more resilient than mine, one where an
external actor is restarting downed processed without the monkey action's
involvement?

It actually performs even worse in this case in my experience since Chaos
monkey can consider the failure mechanism to have failed (and eventually
times out)
because the process is too quick to recover (or the recovery fails because
the process is already running). The only way I was able to get it to run
was to disable
the process that automatically restarts killed processes in my system.

One other thing I hit was the validation for a suspended process was
incorrect so if chaos monkey tried to suspend the process the run would
fail. I'll put up a JIRA for that.

This brings up a discussion on whether the ITBLL (or whatever process)
should even continue if either a killing or recovering action failed. I
would argue that invalidates the entire test,
but it might not be obvious it failed unless you were watching the logs as
it went.

Thanks,
Zach


[1] sudo -u hbase hbase
org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m serverKilling
loop 4 2 100 ${RANDOM} 10

On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk  wrote:

> Hello,
>
> Does anyone have recent experience running Chaos Monkey? Are you running
> against an external cluster, or one of the other modes? What monkey factory
> are you using? Any property overrides? A non-default ClusterManager?
>
> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
> much luck. My environment is an "external" cluster, 4 racks of 4 hosts
> each, the relatively simple "serverKilling" factory with
> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on
> various scheduled, plus some balancer play mixed in; no process suspension.
>
> Running for any length of time (~30 minutes) the chaos monkey eventually
> terminates between a majority and all of the hosts in the cluster. My logs
> are peppered with warnings such as the below. There are other variants. As
> far as I can tell, actions are intended to cause some harm and then restore
> state after themselves. In practice, the harm is successful but restoration
> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
> timeout. The result is a methodical termination of the cluster.
>
> So I'm curious if this matches others' experience running the monkey. For
> example, do you have an environment more resilient than mine, one where an
> external actor is restarting downed processed without the monkey action's
> involvement? Is the monkey designed to run only in such an environment?
> These timeouts are configurable; are you cranking them way up?
>
> Any input you have would be greatly appreciated. This is my last major
> action item blocking initial 2.3.0 release candidates.
>
> Thanks,
> Nick
>
> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
> performing action: java.io.IOException: did timeout 6ms waiting for
> region server to start: host-a.example.com
> at
>
> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
> at
> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
> at
>
> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
> at
>
> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
> at
>
> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
> at
>
> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
> at
>
> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
> at java.base/java.lang.Thread.run(Thread.java:834)
>


Recent experience with Chaos Monkey?

2020-05-07 Thread Nick Dimiduk
Hello,

Does anyone have recent experience running Chaos Monkey? Are you running
against an external cluster, or one of the other modes? What monkey factory
are you using? Any property overrides? A non-default ClusterManager?

I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having
much luck. My environment is an "external" cluster, 4 racks of 4 hosts
each, the relatively simple "serverKilling" factory with
`rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on
various scheduled, plus some balancer play mixed in; no process suspension.

Running for any length of time (~30 minutes) the chaos monkey eventually
terminates between a majority and all of the hosts in the cluster. My logs
are peppered with warnings such as the below. There are other variants. As
far as I can tell, actions are intended to cause some harm and then restore
state after themselves. In practice, the harm is successful but restoration
rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec
timeout. The result is a methodical termination of the cluster.

So I'm curious if this matches others' experience running the monkey. For
example, do you have an environment more resilient than mine, one where an
external actor is restarting downed processed without the monkey action's
involvement? Is the monkey designed to run only in such an environment?
These timeouts are configurable; are you cranking them way up?

Any input you have would be greatly appreciated. This is my last major
action item blocking initial 2.3.0 release candidates.

Thanks,
Nick

20/05/05 21:19:29 WARN policies.Policy: Exception occurred during
performing action: java.io.IOException: did timeout 6ms waiting for
region server to start: host-a.example.com
at
org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163)
at
org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228)
at
org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70)
at
org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61)
at
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at java.base/java.lang.Thread.run(Thread.java:834)