Re: Recent experience with Chaos Monkey?
To follow up, I've needed to apply these two patches to get my local environment running. https://issues.apache.org/jira/browse/HBASE-24360 https://issues.apache.org/jira/browse/HBASE-24361 On Tue, May 12, 2020 at 11:52 AM Nick Dimiduk wrote: > Thanks Zach. > > > It actually performs even worse in this case in my experience since > Chaos monkey can consider the failure mechanism to have failed (and > eventually times out) because the process is too quick to recover (or the > recovery fails because the process is already running). The only way I was > able to get it to run was to disable the process that automatically > restarts killed processes in my system. > > Interesting observation. > > > This brings up a discussion on whether the ITBLL (or whatever process) > should even continue if either a killing or recovering action failed. > I would argue that invalidates the entire test, but it might not be obvious > it failed unless you were watching the logs as it went. > > I'm coming to a similar conclusion -- failure in the orchestration layer > should invalidate the test. > > On Thu, May 7, 2020 at 5:27 PM Zach York > wrote: > >> I should note that I was using HBase 2.2.3 to test. >> >> On Thu, May 7, 2020 at 5:26 PM Zach York >> wrote: >> >> > I recently ran ITBLL with Chaos monkey[1] against a real HBase >> > installation (EMR). I initially tried to run it locally, but couldn't >> get >> > it working and eventually gave up. >> > >> > > So I'm curious if this matches others' experience running the monkey. >> For >> > example, do you have an environment more resilient than mine, one where >> an >> > external actor is restarting downed processed without the monkey >> action's >> > involvement? >> > >> > It actually performs even worse in this case in my experience since >> Chaos >> > monkey can consider the failure mechanism to have failed (and eventually >> > times out) >> > because the process is too quick to recover (or the recovery fails >> because >> > the process is already running). The only way I was able to get it to >> run >> > was to disable >> > the process that automatically restarts killed processes in my system. >> > >> > One other thing I hit was the validation for a suspended process was >> > incorrect so if chaos monkey tried to suspend the process the run would >> > fail. I'll put up a JIRA for that. >> > >> > This brings up a discussion on whether the ITBLL (or whatever process) >> > should even continue if either a killing or recovering action failed. I >> > would argue that invalidates the entire test, >> > but it might not be obvious it failed unless you were watching the logs >> as >> > it went. >> > >> > Thanks, >> > Zach >> > >> > >> > [1] sudo -u hbase hbase >> > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m >> serverKilling >> > loop 4 2 100 ${RANDOM} 10 >> > >> > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk >> wrote: >> > >> >> Hello, >> >> >> >> Does anyone have recent experience running Chaos Monkey? Are you >> running >> >> against an external cluster, or one of the other modes? What monkey >> >> factory >> >> are you using? Any property overrides? A non-default ClusterManager? >> >> >> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not >> having >> >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts >> >> each, the relatively simple "serverKilling" factory with >> >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various >> hosts >> >> on >> >> various scheduled, plus some balancer play mixed in; no process >> >> suspension. >> >> >> >> Running for any length of time (~30 minutes) the chaos monkey >> eventually >> >> terminates between a majority and all of the hosts in the cluster. My >> logs >> >> are peppered with warnings such as the below. There are other >> variants. As >> >> far as I can tell, actions are intended to cause some harm and then >> >> restore >> >> state after themselves. In practice, the harm is successful but >> >> restoration >> >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec >> >
Re: Recent experience with Chaos Monkey?
Thanks Zach. > It actually performs even worse in this case in my experience since Chaos monkey can consider the failure mechanism to have failed (and eventually times out) because the process is too quick to recover (or the recovery fails because the process is already running). The only way I was able to get it to run was to disable the process that automatically restarts killed processes in my system. Interesting observation. > This brings up a discussion on whether the ITBLL (or whatever process) should even continue if either a killing or recovering action failed. I would argue that invalidates the entire test, but it might not be obvious it failed unless you were watching the logs as it went. I'm coming to a similar conclusion -- failure in the orchestration layer should invalidate the test. On Thu, May 7, 2020 at 5:27 PM Zach York wrote: > I should note that I was using HBase 2.2.3 to test. > > On Thu, May 7, 2020 at 5:26 PM Zach York > wrote: > > > I recently ran ITBLL with Chaos monkey[1] against a real HBase > > installation (EMR). I initially tried to run it locally, but couldn't get > > it working and eventually gave up. > > > > > So I'm curious if this matches others' experience running the monkey. > For > > example, do you have an environment more resilient than mine, one where > an > > external actor is restarting downed processed without the monkey action's > > involvement? > > > > It actually performs even worse in this case in my experience since Chaos > > monkey can consider the failure mechanism to have failed (and eventually > > times out) > > because the process is too quick to recover (or the recovery fails > because > > the process is already running). The only way I was able to get it to run > > was to disable > > the process that automatically restarts killed processes in my system. > > > > One other thing I hit was the validation for a suspended process was > > incorrect so if chaos monkey tried to suspend the process the run would > > fail. I'll put up a JIRA for that. > > > > This brings up a discussion on whether the ITBLL (or whatever process) > > should even continue if either a killing or recovering action failed. I > > would argue that invalidates the entire test, > > but it might not be obvious it failed unless you were watching the logs > as > > it went. > > > > Thanks, > > Zach > > > > > > [1] sudo -u hbase hbase > > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m > serverKilling > > loop 4 2 100 ${RANDOM} 10 > > > > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk wrote: > > > >> Hello, > >> > >> Does anyone have recent experience running Chaos Monkey? Are you running > >> against an external cluster, or one of the other modes? What monkey > >> factory > >> are you using? Any property overrides? A non-default ClusterManager? > >> > >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having > >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts > >> each, the relatively simple "serverKilling" factory with > >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts > >> on > >> various scheduled, plus some balancer play mixed in; no process > >> suspension. > >> > >> Running for any length of time (~30 minutes) the chaos monkey eventually > >> terminates between a majority and all of the hosts in the cluster. My > logs > >> are peppered with warnings such as the below. There are other variants. > As > >> far as I can tell, actions are intended to cause some harm and then > >> restore > >> state after themselves. In practice, the harm is successful but > >> restoration > >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec > >> timeout. The result is a methodical termination of the cluster. > >> > >> So I'm curious if this matches others' experience running the monkey. > For > >> example, do you have an environment more resilient than mine, one where > an > >> external actor is restarting downed processed without the monkey > action's > >> involvement? Is the monkey designed to run only in such an environment? > >> These timeouts are configurable; are you cranking them way up? > >> > >> Any input you have would be greatly appreciated. This is my last major > >> action item blocking initi
Re: Recent experience with Chaos Monkey?
I should note that I was using HBase 2.2.3 to test. On Thu, May 7, 2020 at 5:26 PM Zach York wrote: > I recently ran ITBLL with Chaos monkey[1] against a real HBase > installation (EMR). I initially tried to run it locally, but couldn't get > it working and eventually gave up. > > > So I'm curious if this matches others' experience running the monkey. For > example, do you have an environment more resilient than mine, one where an > external actor is restarting downed processed without the monkey action's > involvement? > > It actually performs even worse in this case in my experience since Chaos > monkey can consider the failure mechanism to have failed (and eventually > times out) > because the process is too quick to recover (or the recovery fails because > the process is already running). The only way I was able to get it to run > was to disable > the process that automatically restarts killed processes in my system. > > One other thing I hit was the validation for a suspended process was > incorrect so if chaos monkey tried to suspend the process the run would > fail. I'll put up a JIRA for that. > > This brings up a discussion on whether the ITBLL (or whatever process) > should even continue if either a killing or recovering action failed. I > would argue that invalidates the entire test, > but it might not be obvious it failed unless you were watching the logs as > it went. > > Thanks, > Zach > > > [1] sudo -u hbase hbase > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m serverKilling > loop 4 2 100 ${RANDOM} 10 > > On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk wrote: > >> Hello, >> >> Does anyone have recent experience running Chaos Monkey? Are you running >> against an external cluster, or one of the other modes? What monkey >> factory >> are you using? Any property overrides? A non-default ClusterManager? >> >> I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having >> much luck. My environment is an "external" cluster, 4 racks of 4 hosts >> each, the relatively simple "serverKilling" factory with >> `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts >> on >> various scheduled, plus some balancer play mixed in; no process >> suspension. >> >> Running for any length of time (~30 minutes) the chaos monkey eventually >> terminates between a majority and all of the hosts in the cluster. My logs >> are peppered with warnings such as the below. There are other variants. As >> far as I can tell, actions are intended to cause some harm and then >> restore >> state after themselves. In practice, the harm is successful but >> restoration >> rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec >> timeout. The result is a methodical termination of the cluster. >> >> So I'm curious if this matches others' experience running the monkey. For >> example, do you have an environment more resilient than mine, one where an >> external actor is restarting downed processed without the monkey action's >> involvement? Is the monkey designed to run only in such an environment? >> These timeouts are configurable; are you cranking them way up? >> >> Any input you have would be greatly appreciated. This is my last major >> action item blocking initial 2.3.0 release candidates. >> >> Thanks, >> Nick >> >> 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during >> performing action: java.io.IOException: did timeout 6ms waiting for >> region server to start: host-a.example.com >> at >> >> org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163) >> at >> org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228) >> at >> >> org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70) >> at >> >> org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61) >> at >> >> org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50) >> at >> >> org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) >> at >> >> org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) >> at java.base/java.lang.Thread.run(Thread.java:834) >> >
Re: Recent experience with Chaos Monkey?
I recently ran ITBLL with Chaos monkey[1] against a real HBase installation (EMR). I initially tried to run it locally, but couldn't get it working and eventually gave up. > So I'm curious if this matches others' experience running the monkey. For example, do you have an environment more resilient than mine, one where an external actor is restarting downed processed without the monkey action's involvement? It actually performs even worse in this case in my experience since Chaos monkey can consider the failure mechanism to have failed (and eventually times out) because the process is too quick to recover (or the recovery fails because the process is already running). The only way I was able to get it to run was to disable the process that automatically restarts killed processes in my system. One other thing I hit was the validation for a suspended process was incorrect so if chaos monkey tried to suspend the process the run would fail. I'll put up a JIRA for that. This brings up a discussion on whether the ITBLL (or whatever process) should even continue if either a killing or recovering action failed. I would argue that invalidates the entire test, but it might not be obvious it failed unless you were watching the logs as it went. Thanks, Zach [1] sudo -u hbase hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList -m serverKilling loop 4 2 100 ${RANDOM} 10 On Thu, May 7, 2020 at 5:05 PM Nick Dimiduk wrote: > Hello, > > Does anyone have recent experience running Chaos Monkey? Are you running > against an external cluster, or one of the other modes? What monkey factory > are you using? Any property overrides? A non-default ClusterManager? > > I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having > much luck. My environment is an "external" cluster, 4 racks of 4 hosts > each, the relatively simple "serverKilling" factory with > `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on > various scheduled, plus some balancer play mixed in; no process suspension. > > Running for any length of time (~30 minutes) the chaos monkey eventually > terminates between a majority and all of the hosts in the cluster. My logs > are peppered with warnings such as the below. There are other variants. As > far as I can tell, actions are intended to cause some harm and then restore > state after themselves. In practice, the harm is successful but restoration > rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec > timeout. The result is a methodical termination of the cluster. > > So I'm curious if this matches others' experience running the monkey. For > example, do you have an environment more resilient than mine, one where an > external actor is restarting downed processed without the monkey action's > involvement? Is the monkey designed to run only in such an environment? > These timeouts are configurable; are you cranking them way up? > > Any input you have would be greatly appreciated. This is my last major > action item blocking initial 2.3.0 release candidates. > > Thanks, > Nick > > 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during > performing action: java.io.IOException: did timeout 6ms waiting for > region server to start: host-a.example.com > at > > org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163) > at > org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228) > at > > org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70) > at > > org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61) > at > > org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50) > at > > org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) > at > > org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) > at java.base/java.lang.Thread.run(Thread.java:834) >
Recent experience with Chaos Monkey?
Hello, Does anyone have recent experience running Chaos Monkey? Are you running against an external cluster, or one of the other modes? What monkey factory are you using? Any property overrides? A non-default ClusterManager? I'm trying to run ITBLL with chaos against branch-2.3 and I'm not having much luck. My environment is an "external" cluster, 4 racks of 4 hosts each, the relatively simple "serverKilling" factory with `rolling.batch.suspend.rs.ratio = 0.0`. So, randomly kill various hosts on various scheduled, plus some balancer play mixed in; no process suspension. Running for any length of time (~30 minutes) the chaos monkey eventually terminates between a majority and all of the hosts in the cluster. My logs are peppered with warnings such as the below. There are other variants. As far as I can tell, actions are intended to cause some harm and then restore state after themselves. In practice, the harm is successful but restoration rarely succeeds. Mostly these actions are "safeguarded" but this 60-sec timeout. The result is a methodical termination of the cluster. So I'm curious if this matches others' experience running the monkey. For example, do you have an environment more resilient than mine, one where an external actor is restarting downed processed without the monkey action's involvement? Is the monkey designed to run only in such an environment? These timeouts are configurable; are you cranking them way up? Any input you have would be greatly appreciated. This is my last major action item blocking initial 2.3.0 release candidates. Thanks, Nick 20/05/05 21:19:29 WARN policies.Policy: Exception occurred during performing action: java.io.IOException: did timeout 6ms waiting for region server to start: host-a.example.com at org.apache.hadoop.hbase.HBaseCluster.waitForRegionServerToStart(HBaseCluster.java:163) at org.apache.hadoop.hbase.chaos.actions.Action.startRs(Action.java:228) at org.apache.hadoop.hbase.chaos.actions.RestartActionBaseAction.gracefulRestartRs(RestartActionBaseAction.java:70) at org.apache.hadoop.hbase.chaos.actions.GracefulRollingRestartRsAction.perform(GracefulRollingRestartRsAction.java:61) at org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:50) at org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41) at org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42) at java.base/java.lang.Thread.run(Thread.java:834)