Ah, and one more thing: when developing the chaos-injection mechanism
in the mgo/txn package, I also added both a "chance" parameter for
either killing or slowing down a given breakpoint. It sounds like it
would be useful for juju's mechanism too. If you kill every time, it's
hard to tell whether the system would know how to retry properly.
Killing or slowing down just sometimes, or perhaps the first 2 times
out of every 3, for example, would enable the system to recover
itself, and an external agent to ensure it continues to work properly.

On Wed, Aug 13, 2014 at 11:25 AM, Gustavo Niemeyer
<gustavo.nieme...@canonical.com> wrote:
> That's a nice direction, Menno.
>
> The main thing that comes to mind is that it sounds quite inconvenient
> to turn the feature on. It may sound otherwise because it's so easy to
> drop files at arbitrary places in our local machines, but when dealing
> with a distributed system that knows how to spawn its own resources
> up, suddenly the "just write a file" becomes surprisingly boring and
> race prone.
>
> What about:
>
>     juju inject-failure [--unit=unit] [--service=service] <failure name>"?
>     juju deploy [--inject-failure=name] ...
>
>
>
> On Wed, Aug 13, 2014 at 7:17 AM, Menno Smits <menno.sm...@canonical.com> 
> wrote:
>> There's been some discussion recently about adding some feature to Juju to
>> allow developers or CI tests to intentionally trigger otherwise hard to
>> induce failures in specific parts of Juju. The idea is that sometimes we
>> need some kind of failure to happen in a CI test or when manually testing
>> but those failures can often be hard to make happen.
>>
>> For example, for changes Juju's upgrade mechanics that I'm working on at the
>> moment I would like to ensure that an upgrade is cleanly aborted if one of
>> the state servers in a HA environment refuses to start the upgrade. This
>> logic is well unit tested but there's nothing like seeing it actually work
>> in a real environment to build confidence - however, it isn't easy to make a
>> state server misbehave in this way.
>>
>> To help with this kind of testing scenario, I've created a new top-level
>> package called "wrench" which lets us "drop a wrench in the works" so to
>> speak. It's very simple with one main API which can be called from
>> judiciously chosen points in Juju's execution to decide whether some failure
>> should be triggered.
>>
>> The module looks for files in $jujudatadir/wrench (typically
>> /var/lib/juju/wrench) on the local machine. If I wanted to trigger the
>> upgrade failure described above I could drop a file in that directory on one
>> of the state servers named say "machine-agent" with the content:
>>
>> refuse-upgrade
>>
>> Then in some part of jujud's upgrade code there could be a check like:
>>
>> if wrench.IsActive("machine-agent", "refuse-upgrade") {
>>      // trigger the failure
>> }
>>
>> The idea is this check would be left in the code to aid CI tests and future
>> manual tests.
>>
>> You can see the incomplete wrench package here:
>> https://github.com/juju/juju/pull/508
>>
>> There are a few issues to nut out.
>>
>> 1. It needs to be difficult/impossible for someone to accidentally or
>> maliciously activate this feature, especially in production environments. I
>> have almost finished (but not pushed to Github) some changes to the wrench
>> package which make it strict about the ownership and permissions on the
>> wrench files. This should make it harder for the wrong person to drop files
>> in to the wrench directory.
>>
>> The idea has also been floated to only enable this functionality in
>> non-stable builds. This certainly gives a good level of protection but I'm
>> slightly wary of this approach because it makes it impossible for CI to take
>> advantage of the wrench feature when testing stable release builds. I'm
>> happy to be convinced that the benefit is worth the cost.
>>
>> Other ideas on how to better handle this are very welcome.
>>
>> 2. The wrench functionality needs to be disabled during unit test runs
>> because we don't want any wrench files a developer may have lying around to
>> affect Juju's behaviour during test runs. The wrench package has a global
>> on/off switch so I plan on switching it off in BaseSuite's setup or similar.
>>
>> 3. The name is a bikeshedding magnet :)  Other names that have been bandied
>> about for this feature are "chaos" and "spanner". I don't care too much so
>> if there's a strong consensus for another name let's use that. I chose
>> "wrench" over "spanner" because I believe that's the more common usage in
>> the US and because Spanner is a DB from Google. Let's not get carried
>> away...
>>
>> All comments, ideas and concerns welcome.
>>
>> - Menno
>>
>>
>>
>> --
>> Juju-dev mailing list
>> Juju-dev@lists.ubuntu.com
>> Modify settings or unsubscribe at:
>> https://lists.ubuntu.com/mailman/listinfo/juju-dev
>>
>
> --
> gustavo @ http://niemeyer.net



-- 
gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev

Reply via email to