Sending an update before the weekend: Gate was in very bad shape today (long queue, lot of failures) again today, and it turns out we had a few more issues that we tracked here: https://etherpad.openstack.org/p/tripleo-gate-issues-june-2018
## scenario007 broke because of a patch in networking-ovn https://bugs.launchpad.net/tripleo/+bug/1777168 We made the job non voting and meanwhile tried and managed to fix it: https://review.rdoproject.org/r/#/c/14155/ Breaking commit was: https://github.com/openstack/networking-ovn/commit/2365df1cc3e24deb2f3745c925d78d6d8e5bb5df Kudos to Daniel Alvarez for having the patch ready! Also thanks to Wes for making the job non voting in the meantime. I've reverted the non-voting things are situation is fixed now, so we can vote again on this one. ## Dockerhub proxy issue Infra using wrong image layer object storage proxy for Dockerhub: https://review.openstack.org/#/c/575787/ Huge thanks to infra team, specially Clark for fixing this super quickly, it clearly helped to stabilize our container jobs, I actually haven't seen timeouts since we merged your patch. Thanks a ton! ## RDO master wasn't consistent anymore, python-cloudkittyclient broke The client was refactored: https://git.openstack.org/cgit/openstack/python-cloudkittyclient/commit/?id=d070f6a68cddf51c57e77107f1b823a8f75770ba And it broke the RPM, we had to completely rewrite the dependencies so we can build the package: https://review.rdoproject.org/r/#/c/14265/ Mille merci Heikel for your responsive help at 3am, so we could come back consistent and have our latest rpms that contained a bunch of fixes. ## Where we are now Gate looks stable now. You can recheck and approve things. I went ahead and rechecked everything and made sure nothing was left abandoned. Steve's work has merged so I think we could re-consider https://review.openstack.org/#/c/575330/ again. Special thanks to everyone involved in these issues and Alex & John who also stepped up to help. Enjoy your weekend! On Thu, Jun 14, 2018 at 6:40 AM, Emilien Macchi <emil...@redhat.com> wrote: > It sounds like we merged a bunch last night thanks to the revert, so I > went ahead and restored/rechecked everything that was out of the gate. I've > checked and nothing was left over, but let me know in case I missed > something. > I'll keep updating this thread with the progress made to improve the > situation etc. > So from now, situation is back to "normal", recheck/+W is ok. > > Thanks again for your patience, > > On Wed, Jun 13, 2018 at 10:39 PM, Emilien Macchi <emil...@redhat.com> > wrote: > >> https://review.openstack.org/575264 just landed (and didn't timeout in >> check nor gate without recheck, so good sigh it helped to mitigate). >> >> I've restore and rechecked some patches that I evacuated from the gate, >> please do not restore others or recheck or approve anything for now, and >> see how it goes with a few patches. >> We're still working with Steve on his patches to optimize the way we >> deploy containers on the registry and are investigating how we could make >> it faster with a proxy. >> >> Stay tuned and thanks for your patience. >> >> On Wed, Jun 13, 2018 at 5:50 PM, Emilien Macchi <emil...@redhat.com> >> wrote: >> >>> TL;DR: gate queue was 25h+, we put all patches from gate on standby, do >>> not restore/recheck until further announcement. >>> >>> We recently enabled the containerized undercloud for multinode jobs and >>> we believe this was a bit premature as the container download process >>> wasn't optimized so it's not pulling the mirrors for the same containers >>> multiple times yet. >>> It caused the job runtime to increase and probably the load on docker.io >>> mirrors hosted by OpenStack Infra to be a bit slower to provide the same >>> containers multiple times. The time taken to prepare containers on the >>> undercloud and then for the overcloud caused the jobs to randomly timeout >>> therefore the gate to fail in a high amount of times, so we decided to >>> remove all jobs from the gate by abandoning the patches temporarily (I have >>> them in my browser and will restore when things are stable again, please do >>> not touch anything). >>> >>> Steve Baker has been working on a series of patches that optimize the >>> way we prepare the containers but basically the workflow will be: >>> - pull containers needed for the undercloud into a local registry, using >>> infra mirror if available >>> - deploy the containerized undercloud >>> - pull containers needed for the overcloud minus the ones already pulled >>> for the undercloud, using infra mirror if available >>> - update containers on the overcloud >>> - deploy the containerized undercloud >>> >>> With that process, we hope to reduce the runtime of the deployment and >>> therefore reduce the timeouts in the gate. >>> To enable it, we need to land in that order: https://review.openstac >>> k.org/#/c/571613/, https://review.openstack.org/#/c/574485/, >>> https://review.openstack.org/#/c/571631/ and https://review.openstack.o >>> rg/#/c/568403. >>> >>> In the meantime, we are disabling the containerized undercloud recently >>> enabled on all scenarios: https://review.openstack.org/#/c/575264/ for >>> mitigation with the hope to stabilize things until Steve's patches land. >>> Hopefully, we can merge Steve's work tonight/tomorrow and re-enable the >>> containerized undercloud on scenarios after checking that we don't have >>> timeouts and reasonable deployment runtimes. >>> >>> That's the plan we came with, if you have any question / feedback please >>> share it. >>> -- >>> Emilien, Steve and Wes >>> >> >> >> >> -- >> Emilien Macchi >> > > > > -- > Emilien Macchi > -- Emilien Macchi
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev