Hi Raoul and all, Sorry for joining this discussion late!
Raoul Scarazzini <ra...@redhat.com> wrote:
TL;DR: we would like to change the way HA is tested upstream to avoid being hitten by evitable bugs that the CI process should discover. Long version: Today HA testing in upstream consist only in verifying that a three controllers setup comes up correctly and can spawn an instance. That's something, but it’s far from being enough since we continuously see "day two" bugs. We started covering this more than a year ago in internal CI and today also on rdocloud using a project named tripleo-quickstart-utils . Apart from his name, the project is not limited to tripleo-quickstart, it covers three principal roles: 1 - stonith-config: a playbook that can be used to automate the creation of fencing devices in the overcloud; 2 - instance-ha: a playbook that automates the seventeen manual steps needed to configure instance HA in the overcloud, test them via rally and verify that instance HA works; 3 - validate-ha: a playbook that runs a series of disruptive actions in the overcloud and verifies it always behaves correctly by deploying a heat-template that involves all the overcloud components;
Yes, a more rigorous approach to HA testing obviously has huge value, not just for TripleO deployments, but also for any type of OpenStack deployment.
To make this usable upstream, we need to understand where to put this code. Here some choices:
[snipped] I do not work on TripleO, but I'm part of the wider OpenStack sub-communities which focus on HA and more recently, self-healing. With that hat on, I'd like to suggest that maybe it's possible to collaborate on this in a manner which is agnostic to the deployment mechanism. There is an open spec on this: https://review.openstack.org/#/c/443504/ which was mentioned in the Denver PTG session on destructive testing which you referenced. As mentioned in the self-healing SIG's session in Dublin, the OPNFV community has already put a lot of effort into testing HA scenarios, and it would be great if this work was shared across the whole OpenStack community. In particular they have a project called Yardstick: https://www.opnfv.org/community/projects/yardstick which contains a bunch of HA test cases: http://docs.opnfv.org/en/latest/submodules/yardstick/docs/testing/user/userguide/15-list-of-tcs.html#h-a Currently each sub-community and vendor seems to be reinventing HA testing by itself to some extent, which is easier to accomplish in the short-term, but obviously less efficient in the long-term. It would be awesome if we could break these silos down and join efforts! :-) Cheers, Adam  #openstack-ha on Freenode IRC  https://wiki.openstack.org/wiki/Self-healing_SIG  https://etherpad.openstack.org/p/qa-queens-ptg-destructive-testing  https://etherpad.openstack.org/p/self-healing-ptg-rocky __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev