Re: [openstack-dev] [tripleo] rh1 outage today

Ben Nemec Mon, 30 Oct 2017 15:16:05 -0700

It turns out this wasn't _quite_ resolved yet. I was still seeing someexcessively long stack creation times today and it turns out one of ourcompute nodes had virtualization turned off. This caused all of itsinstances to fail and need a retry. Once I disabled the compute serviceon it stacks seemed to be creating in a normal amount of time again.

This happened because the node had some hardware issues, and apparentlythe fix was to replace the system board so we got it back witheverything set to default. I fixed this and re-enabled the node and allseems well again.


On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:

Thanks for the postmortem; it's always a good read tp learn stuff :)
On 28 Oct 2017 00:11, "Ben Nemec" <[email protected]<mailto:[email protected]>> wrote:
    Hi all,

    As you may or may not have noticed all ovb jobs on rh1 started
    failing sometime last night.  After some investigation today I found
    a few issues.

    First, our nova db archiving wasn't working.  This was due to the
    auto-increment counter issue described by melwitt in
    
http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html
<http://lists.openstack.org/pipermail/openstack-dev/2017-September/122903.html>Deleting the problematic rows from the shadow table got us past that.
    On another db-related note, we seem to have turned ceilometer back
    on at some point in rh1.  I think that was intentional to avoid
notification queues backing up, but it led to a different problem.We had approximately 400 GB of mongodb data from ceilometer that we
    don't actually care about.  I cleaned that up and set a TTL in
    ceilometer so hopefully this won't happen again.
Is there an alarm or something we could set to get notified about thiskind of stuff? Or better yet, something we could automate to avoid this?What's usimg mongodb nowadays?

Setting a TTL should avoid this in the future. Note that I don't thinkmongo is still used by default, but in our old Mitaka version it was.

For the nova archiving thing I think we'd have to set up emailnotifications for failed cron jobs. That would be a good RFE.



    Unfortunately neither of these things completely resolved the
    extreme slowness in the cloud that was causing every testenv to
    fail.  After trying a number of things that made no difference, the
    culprit seems to have been rabbitmq.  There was nothing obviously
    wrong with it according to the web interface, the queues were all
    short and messages seemed to be getting delivered.  However, when I
    ran rabbitmqctl status at the CLI it reported that the node was
    down.  Since something was clearly wrong I went ahead and restarted
    it.  After that everything seems to be back to normal.

Same questiom as above, could we set and alarm or automate the noderecovery?

On this one I have no idea. As I noted, when I looked at the rabbit webui everything looked fine. This isn't like the notification queueproblem where one look at the queue lengths made it obvious somethingwas wrong. Messages were being delivered successfully, just very, veryslowly. Maybe looking at messages per second would help, but that wouldbe hard to automate. You'd have to know if there were few messagesgoing through because of performance issues or if the cloud is justunder light load.

I guess it's also worth noting that at some point this cloud is goingaway in favor of RDO cloud. Of course we said that back in Decemberwhen we discussed the OVS port exhaustion issue and now 11 months laterit still hasn't happened. That's why I haven't been too inclined topursue extensive monitoring for the existing cloud though.



    I'm not sure exactly what the cause of all this was.  We did get
    kind of inundated with jobs yesterday after a zuul restart which I
    think is what probably pushed us over the edge, but that has
    happened before without bringing the cloud down.  It was probably a
    combination of some previously unnoticed issues stacking up over
    time and the large number of testenvs requested all at once.

    In any case, testenvs are creating successfully again and the jobs
    in the queue look good so far.  If you notice any problems please
    let me know though.  I'm hoping this will help with the job
    timeouts, but that remains to be seen.

    -Ben

    __________________________________________________________________________
    OpenStack Development Mailing List (not for usage questions)
    Unsubscribe:
    [email protected]?subject:unsubscribe
    <http://[email protected]?subject:unsubscribe>
    http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
    <http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev>




__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] rh1 outage today

Reply via email to