On 22 March 2017 at 22:36, Ben Nemec <openst...@nemebean.com> wrote: > Hi all (owl?), > > You may have missed it in all the ci excitement the past couple of days, but > we had a partial outage of rh1 last night. It turns out the OVS port issue > Derek discussed in > http://lists.openstack.org/pipermail/openstack-dev/2016-December/109182.html > reared its ugly head on a few of our compute nodes, which caused them to be > unable to spawn new instances. They kept getting scheduled since it looked > like they were underutilized, which caused most of our testenvs to fail. > > I've rebooted the affected nodes, as well as a few more that looked like > they might run into the same problem in the near future. Everything looks > to be working well again since sometime this morning (when I disabled the > broken compute nodes), but there aren't many jobs passing due to the > plethora of other issues we're hitting in ci. There have been some stable > job passes though so I believe things are working again. > > As far as preventing this in the future, the right thing to do would > probably be to move to a later release of OpenStack (either point or major) > where hopefully this problem would be fixed. However, I'm hesitant to do > that for a few reasons. First is "the devil you know". Outside of this > issue, we've gotten rh1 pretty rock solid lately. It's been overworked, but > has been cranking away for months with no major cloud-related outages. > Second is that an upgrade would be a major process, probably involving some > amount of downtime. Since the long-term plan is to move everything to RDO > cloud I'm not sure that's the best use of our time at this point.
+1 on keeping the status quo until moving to rdo-cloud. > > Instead, my plan for the near term is to keep a closer eye on the error > notifications from the services. We previously haven't had anything > consuming those, but I've dropped a little tool on the controller that will > dump out error notifications so we can watch for signs of this happening > again. I suspect the signs were there long before the actual breakage > happened, but nobody was looking for them. Now I will be. > > So that's where things stand with rh1. Any comments or concerns welcome. > > Thanks. > > -Ben > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev