TL;DR: we would like to change the way HA is tested upstream to avoid
being hitten by evitable bugs that the CI process should discover.

Long version:

Today HA testing in upstream consist only in verifying that a three
controllers setup comes up correctly and can spawn an instance. That's
something, but it’s far from being enough since we continuously see "day
two" bugs.
We started covering this more than a year ago in internal CI and today
also on rdocloud using a project named tripleo-quickstart-utils [1].
Apart from his name, the project is not limited to tripleo-quickstart,
it covers three principal roles:

1 - stonith-config: a playbook that can be used to automate the creation
of fencing devices in the overcloud;
2 - instance-ha: a playbook that automates the seventeen manual steps
needed to configure instance HA in the overcloud, test them via rally
and verify that instance HA works;
3 - validate-ha: a playbook that runs a series of disruptive actions in
the overcloud and verifies it always behaves correctly by deploying a
heat-template that involves all the overcloud components;

To make this usable upstream, we need to understand where to put this
code. Here some choices:

1 - tripleo-validations: the most logical place to put this, at least
looking at the name, would be tripleo-validations. I've talked with some
of the folks working on it, and it came out that the meaning of
tripleo-validations project is not doing disruptive tests. Integrating
this stuff would be out of scope.

2 - tripleo-quickstart-extras: apart from the fact that this is not
something meant just for quickstart (the project supports infrared and
"plain" environments as well) even if we initially started there, in the
end, it came out that nobody was looking at the patches since nobody was
able to verify them. The result was a series of reviews stuck forever.
So moving back to extras would be a step backward.

3 - Dedicated project (tripleo-ha-utils or just tripleo-utils): like for
tripleo-upgrades or tripleo-validations it would be perfect having all
this grouped and usable as a standalone thing. Any integration is
possible inside the playbook for whatever kind of test. Today we're
using the bash framework to interact with the cluster, rally to test
instance-ha and Ansible itself to simulate full power outage scenarios.

There's been a lot of talk about this during the last PTG [2], and
unfortunately, I'll not be part of the next one, but I would like to see
things moving on this side.
Everything I wrote is of course up to discussion, that's precisely the
meaning of this mail.

Thanks to all who'll give advice, suggestions, and thoughts about all
this stuff.


Raoul Scarazzini

OpenStack Development Mailing List (not for usage questions)

Reply via email to