Re: [openstack-dev] [tripleo] rh1 issues post-mortem

Wesley Hayutin Fri, 02 Jun 2017 13:49:00 -0700

On Fri, Jun 2, 2017 at 4:42 PM, Ben Nemec <[email protected]> wrote:


>
>
> On 03/28/2017 05:01 PM, Ben Nemec wrote:
>
>> Final (hopefully) update:
>>
>> All active compute nodes have been rebooted and things seem to be stable
>> again.  Jobs are even running a little faster, so I'm thinking this had
>> a detrimental effect on performance too.  I've set a reminder for about
>> two months from now to reboot again if we're still using this environment.
>>
>
> The reminder popped up this week, and I've rebooted all the compute nodes
> again.  It went pretty smoothly so I doubt anyone noticed that it happened
> (except that I forgot to restart the zuul-status webapp), but if you run
> across any problems let me know.


Thanks Ben! http://zuul-status.tripleo.org/ is awesome, I missed it.


>
>
>
>> On 03/24/2017 12:48 PM, Ben Nemec wrote:
>>
>>> To follow-up on this, we've continued to hit this issue on other compute
>>> nodes.  Not surprising, of course.  They've all been up for about the
>>> same period of time and have had largely even workloads.
>>>
>>> It has caused problems though because it is cropping up faster than I
>>> can respond (it takes a few hours to cycle all the instances off a
>>> compute node, and I need to sleep sometime :-), so I've started
>>> pre-emptively rebooting compute nodes to get ahead of it.  Hopefully
>>> I'll be able to get all of the potentially broken nodes at least
>>> disabled by the end of the day so we'll have another 3 months before we
>>> have to worry about this again.
>>>
>>> On 03/24/2017 11:47 AM, Derek Higgins wrote:
>>>
>>>> On 22 March 2017 at 22:36, Ben Nemec <[email protected]> wrote:
>>>>
>>>>> Hi all (owl?),
>>>>>
>>>>> You may have missed it in all the ci excitement the past couple of
>>>>> days, but
>>>>> we had a partial outage of rh1 last night.  It turns out the OVS port
>>>>> issue
>>>>> Derek discussed in
>>>>> http://lists.openstack.org/pipermail/openstack-dev/2016-Dece
>>>>> mber/109182.html
>>>>>
>>>>>
>>>>> reared its ugly head on a few of our compute nodes, which caused them
>>>>> to be
>>>>> unable to spawn new instances.  They kept getting scheduled since it
>>>>> looked
>>>>> like they were underutilized, which caused most of our testenvs to
>>>>> fail.
>>>>>
>>>>> I've rebooted the affected nodes, as well as a few more that looked
>>>>> like
>>>>> they might run into the same problem in the near future.  Everything
>>>>> looks
>>>>> to be working well again since sometime this morning (when I disabled
>>>>> the
>>>>> broken compute nodes), but there aren't many jobs passing due to the
>>>>> plethora of other issues we're hitting in ci.  There have been some
>>>>> stable
>>>>> job passes though so I believe things are working again.
>>>>>
>>>>> As far as preventing this in the future, the right thing to do would
>>>>> probably be to move to a later release of OpenStack (either point or
>>>>> major)
>>>>> where hopefully this problem would be fixed.  However, I'm hesitant
>>>>> to do
>>>>> that for a few reasons.  First is "the devil you know". Outside of this
>>>>> issue, we've gotten rh1 pretty rock solid lately.  It's been
>>>>> overworked, but
>>>>> has been cranking away for months with no major cloud-related outages.
>>>>> Second is that an upgrade would be a major process, probably
>>>>> involving some
>>>>> amount of downtime.  Since the long-term plan is to move everything
>>>>> to RDO
>>>>> cloud I'm not sure that's the best use of our time at this point.
>>>>>
>>>>
>>>> +1 on keeping the status quo until moving to rdo-cloud.
>>>>
>>>>
>>>>> Instead, my plan for the near term is to keep a closer eye on the error
>>>>> notifications from the services.  We previously haven't had anything
>>>>> consuming those, but I've dropped a little tool on the controller
>>>>> that will
>>>>> dump out error notifications so we can watch for signs of this
>>>>> happening
>>>>> again.  I suspect the signs were there long before the actual breakage
>>>>> happened, but nobody was looking for them.  Now I will be.
>>>>>
>>>>> So that's where things stand with rh1.  Any comments or concerns
>>>>> welcome.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> -Ben
>>>>>
>>>>> ____________________________________________________________
>>>>> ______________
>>>>>
>>>>>
>>>>> OpenStack Development Mailing List (not for usage questions)
>>>>> Unsubscribe:
>>>>> [email protected]?subject:unsubscribe
>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>>
>>>>
>>>> ____________________________________________________________
>>>> ______________
>>>>
>>>>
>>>> OpenStack Development Mailing List (not for usage questions)
>>>> Unsubscribe:
>>>> [email protected]?subject:unsubscribe
>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>>
>>>>
>>> ____________________________________________________________
>>> ______________
>>>
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
>>> [email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>>
>>
>> ____________________________________________________________
>> ______________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe: [email protected]?subject:unsubscrib
>> e
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] rh1 issues post-mortem

Reply via email to