On 03/07/2016 12:00 PM, Derek Higgins wrote: > On 7 March 2016 at 12:11, John Trowbridge <[email protected]> wrote: >> >> >> On 03/06/2016 11:58 AM, James Slagle wrote: >>> On Sat, Mar 5, 2016 at 11:15 AM, Emilien Macchi <[email protected]> wrote: >>>> I'm kind of hijacking Dan's e-mail but I would like to propose some >>>> technical improvements to stop having so much CI failures. >>>> >>>> >>>> 1/ Stop creating swap files. We don't have SSD, this is IMHO a terrible >>>> mistake to swap on files because we don't have enough RAM. In my >>>> experience, swaping on non-SSD disks is even worst that not having >>>> enough RAM. We should stop doing that I think. >>> >>> We have been relying on swap in tripleo-ci for a little while. While >>> not ideal, it has been an effective way to at least be able to test >>> what we've been testing given the amount of physical RAM that is >>> available. >>> >>> The recent change to add swap to the overcloud nodes has proved to be >>> unstable. But that has more to do with it being racey with the >>> validation deployment afaict. There are some patches currently up to >>> address those issues. >>> >>>> >>>> >>>> 2/ Split CI jobs in scenarios. >>>> >>>> Currently we have CI jobs for ceph, HA, non-ha, containers and the >>>> current situation is that jobs fail randomly, due to performances issues. >>>> >>>> Puppet OpenStack CI had the same issue where we had one integration job >>>> and we never stopped adding more services until all becomes *very* >>>> unstable. We solved that issue by splitting the jobs and creating >>>> scenarios: >>>> >>>> https://github.com/openstack/puppet-openstack-integration#description >>>> >>>> What I propose is to split TripleO jobs in more jobs, but with less >>>> services. >>>> >>>> The benefit of that: >>>> >>>> * more services coverage >>>> * jobs will run faster >>>> * less random issues due to bad performances >>>> >>>> The cost is of course it will consume more resources. >>>> That's why I suggest 3/. >>>> >>>> We could have: >>>> >>>> * HA job with ceph and a full compute scenario (glance, nova, cinder, >>>> ceilometer, aodh & gnocchi). >>>> * Same with IPv6 & SSL. >>>> * HA job without ceph and full compute scenario too >>>> * HA job without ceph and basic compute (glance and nova), with extra >>>> services like Trove, Sahara, etc. >>>> * ... >>>> (note: all jobs would have network isolation, which is to me a >>>> requirement when testing an installer like TripleO). >>> >>> Each of those jobs would at least require as much memory as our >>> current HA job. I don't see how this gets us to using less memory. The >>> HA job we have now already deploys the minimal amount of services that >>> is possible given our current architecture. Without the composable >>> service roles work, we can't deploy less services than we already are. >>> >>> >>> >>>> >>>> 3/ Drop non-ha job. >>>> I'm not sure why we have it, and the benefit of testing that comparing >>>> to HA. >>> >>> In my opinion, I actually think that we could drop the ceph and non-ha >>> job from the check-tripleo queue. >>> >>> non-ha doesn't test anything realistic, and it doesn't really provide >>> any faster feedback on patches. It seems at most it might run 15-20 >>> minutes faster than the HA job on average. Sometimes it even runs >>> slower than the HA job. >>> >>> The ceph job we could move to the experimental queue to run on demand >>> on patches that might affect ceph, and it could also be a daily >>> periodic job. >>> >>> The same could be done for the containers job, an IPv6 job, and an >>> upgrades job. Ideally with a way to run an individual job as needed. >>> Would we need different experimental queues to do that? >>> >>> That would leave only the HA job in the check queue, which we should >>> run with SSL and network isolation. We could deploy less testenv's >>> since we'd have less jobs running, but give the ones we do deploy more >>> RAM. I think this would really alleviate a lot of the transient >>> intermittent failures we get in CI currently. It would also likely run >>> faster. >>> >>> It's probably worth seeking out some exact evidence from the RDO >>> centos-ci, because I think they are testing with virtual environments >>> that have a lot more RAM than tripleo-ci does. It'd be good to >>> understand if they have some of the transient failures that tripleo-ci >>> does as well. >>> >> >> The HA job in RDO CI is also more unstable than nonHA, although this is >> usually not to do with memory contention. Most of the time that I see >> the HA job fail spuriously in RDO CI, it is because of the Nova >> scheduler race. I would bet that this race is the cause for the >> fluctuating amount of time jobs take as well, because the recovery >> mechanism for this is just to retry. Those retries can add 15 min. per >> retry to the deploy. In RDO CI there is a 60min. timeout for deploy as >> well. If we can't deploy to virtual machines in under an hour, to me >> that is a bug. (Note, I am speaking of `openstack overcloud deploy` when >> I say deploy, though start to finish can take less than an hour with >> decent CPUs) >> >> RDO CI uses the following layout: >> Undercloud: 12G RAM, 4 CPUs >> 3x Control Nodes: 4G RAM, 1 CPU >> Compute Node: 4G RAM, 1 CPU > We're currently using 4G overcloud nodes also, if we ever bump this > you'll probably have to also. > >> >> Is there any ability in our current CI setup to auto-identify the cause >> of a failure? The nova scheduler race has some tell tale log snippets we >> could search for, and we could even auto-recheck jobs that hit known > > We attempted this in the past, iirc we had some rules in elastic > recheck to catch some of the error patterns we were seeing at the > time, eventually that work was stalled here > https://review.openstack.org/#/c/98154/ > Somebody at the time (don't remember who) then agreed to do some of > the dashboard changes needed but they mustn't have gotten the time to > do it. Maybe we could revisit it, who knows things might have changed > enough since then that the concerns raised no longer apply. > >> issues. That combined with some record of how often we hit these known >> issues would be really helpful. > > We can currently use logstash to find specific error patterns for > errors that make their way to the console log, so for a subset of bugs > we can see how often we hit them, this could also be improved by > stashing more into logstash.
I've been collecting queries here: https://etherpad.openstack.org/p/tripleo-ci-logstash-queries > >> >>> We really are deploying on the absolute minimum cpu/ram requirements >>> that is even possible. I think it's unrealistic to expect a lot of >>> stability in that scenario. And I think that's a big reason why we get >>> so many transient failures. >>> >>> In summary: give the testenv's more ram, have one job in the >>> check-tripleo queue, as many jobs as needed in the experimental queue, >>> and as many periodic jobs as necessary. >>> >> +1 I like this idea. >>> >>>> >>>> >>>> Any comment / feedback is welcome, >>>> -- >>>> Emilien Macchi >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: [email protected]?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>> >>> >>> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
