Hello, Previously we saw Libvirt crashes, OOMs, and Tempest SSH Banner failures were a problem. The SSH Banner failures have since been sorted out, thank you to everyone that helped work that out. For details please see https://review.openstack.org/#/c/439638/. There is also a fix for a race that was causing ssh to fail in test_attach_detach_volume that has been fixed in https://review.openstack.org/#/c/449661/ (this change is not merged yet so would be great if tempest cores could get this in).
To address the OOMs we've also seen work to reduce the memory overhead in running devstack. Changes to modify Apache's memory use have gone in: https://review.openstack.org/#/c/446741/ https://review.openstack.org/#/c/445910/ We also tried putting MySQL on a diet, but that had to be reverted, https://review.openstack.org/#/c/446196/. There is also a memory_tracker logging service which you'll find logs for in your job logs now. This can be useful in determining where memory was used which you can use to reduce memory use. https://review.openstack.org/#/c/434470/. It is great to see people take an interest in addressing memory issues. And we no longer see OOMkiller being a major problem according to elastic-recheck. That said there is more that we can do here. Outstanding changes that may help too include: https://review.openstack.org/#/c/447119/ https://review.openstack.org/#/c/450207/ But we also really need individual projects to be looking at the memory consumption of openstack itself and work on trimming as they are able. Unfortunately the Libvirt crashes continue to be a problem. Current top issues: 1. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911 and http://status.openstack.org/elastic-recheck/#1646779 Libvirt is randomly crashing during the job which causes things to fail (for obvious reasons). To address this will likely require someone with experience debugging libvirt since it's most likely a bug isolated to libvirt. We're looking for someone familiar with libvirt internals to drive the effort to fix this issue, 2. Network packet loss in OSIC This has caused connectivity errors to external services. Various e-r bugs like http://status.openstack.org/elastic-recheck/index.html#1282876 http://status.openstack.org/elastic-recheck/index.html#1674681 http://status.openstack.org/elastic-recheck/index.html#1669162 http://status.openstack.org/elastic-recheck/index.html#1326813 all appear to have tripped on this. We expect that the problem has been corrected, but we should keep an eye on these and make sure they fall off the e-r list. Also our classification rate has taken a nose dive lately: http://status.openstack.org/elastic-recheck/data/integrated_gate.html Something that would help out is if people start classifiying these failures. While the overall failure rate is lower than in previous weeks, having a low classification rate means there are race conditions (or other failures) we're not tracking yet, which will only make it more difficult to fix. Normally if there is < a 90% classification rate we've got at least one big persistent failure condition we're not aware of yet. Thank you, mtreinish and clarkb __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev