On Fri, Jun 2, 2017 at 4:42 PM, Ben Nemec <[email protected]> wrote:
> > > On 03/28/2017 05:01 PM, Ben Nemec wrote: > >> Final (hopefully) update: >> >> All active compute nodes have been rebooted and things seem to be stable >> again. Jobs are even running a little faster, so I'm thinking this had >> a detrimental effect on performance too. I've set a reminder for about >> two months from now to reboot again if we're still using this environment. >> > > The reminder popped up this week, and I've rebooted all the compute nodes > again. It went pretty smoothly so I doubt anyone noticed that it happened > (except that I forgot to restart the zuul-status webapp), but if you run > across any problems let me know. Thanks Ben! http://zuul-status.tripleo.org/ is awesome, I missed it. > > > >> On 03/24/2017 12:48 PM, Ben Nemec wrote: >> >>> To follow-up on this, we've continued to hit this issue on other compute >>> nodes. Not surprising, of course. They've all been up for about the >>> same period of time and have had largely even workloads. >>> >>> It has caused problems though because it is cropping up faster than I >>> can respond (it takes a few hours to cycle all the instances off a >>> compute node, and I need to sleep sometime :-), so I've started >>> pre-emptively rebooting compute nodes to get ahead of it. Hopefully >>> I'll be able to get all of the potentially broken nodes at least >>> disabled by the end of the day so we'll have another 3 months before we >>> have to worry about this again. >>> >>> On 03/24/2017 11:47 AM, Derek Higgins wrote: >>> >>>> On 22 March 2017 at 22:36, Ben Nemec <[email protected]> wrote: >>>> >>>>> Hi all (owl?), >>>>> >>>>> You may have missed it in all the ci excitement the past couple of >>>>> days, but >>>>> we had a partial outage of rh1 last night. It turns out the OVS port >>>>> issue >>>>> Derek discussed in >>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-Dece >>>>> mber/109182.html >>>>> >>>>> >>>>> reared its ugly head on a few of our compute nodes, which caused them >>>>> to be >>>>> unable to spawn new instances. They kept getting scheduled since it >>>>> looked >>>>> like they were underutilized, which caused most of our testenvs to >>>>> fail. >>>>> >>>>> I've rebooted the affected nodes, as well as a few more that looked >>>>> like >>>>> they might run into the same problem in the near future. Everything >>>>> looks >>>>> to be working well again since sometime this morning (when I disabled >>>>> the >>>>> broken compute nodes), but there aren't many jobs passing due to the >>>>> plethora of other issues we're hitting in ci. There have been some >>>>> stable >>>>> job passes though so I believe things are working again. >>>>> >>>>> As far as preventing this in the future, the right thing to do would >>>>> probably be to move to a later release of OpenStack (either point or >>>>> major) >>>>> where hopefully this problem would be fixed. However, I'm hesitant >>>>> to do >>>>> that for a few reasons. First is "the devil you know". Outside of this >>>>> issue, we've gotten rh1 pretty rock solid lately. It's been >>>>> overworked, but >>>>> has been cranking away for months with no major cloud-related outages. >>>>> Second is that an upgrade would be a major process, probably >>>>> involving some >>>>> amount of downtime. Since the long-term plan is to move everything >>>>> to RDO >>>>> cloud I'm not sure that's the best use of our time at this point. >>>>> >>>> >>>> +1 on keeping the status quo until moving to rdo-cloud. >>>> >>>> >>>>> Instead, my plan for the near term is to keep a closer eye on the error >>>>> notifications from the services. We previously haven't had anything >>>>> consuming those, but I've dropped a little tool on the controller >>>>> that will >>>>> dump out error notifications so we can watch for signs of this >>>>> happening >>>>> again. I suspect the signs were there long before the actual breakage >>>>> happened, but nobody was looking for them. Now I will be. >>>>> >>>>> So that's where things stand with rh1. Any comments or concerns >>>>> welcome. >>>>> >>>>> Thanks. >>>>> >>>>> -Ben >>>>> >>>>> ____________________________________________________________ >>>>> ______________ >>>>> >>>>> >>>>> OpenStack Development Mailing List (not for usage questions) >>>>> Unsubscribe: >>>>> [email protected]?subject:unsubscribe >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>> >>>> ____________________________________________________________ >>>> ______________ >>>> >>>> >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> [email protected]?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> ____________________________________________________________ >>> ______________ >>> >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >> >> ____________________________________________________________ >> ______________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: [email protected]?subject:unsubscrib >> e >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
