It turns out this wasn't _quite_ resolved yet. I was still seeing some excessively long stack creation times today and it turns out one of our compute nodes had virtualization turned off. This caused all of its instances to fail and need a retry. Once I disabled the compute service on it stacks seemed to be creating in a normal amount of time again.

This happened because the node had some hardware issues, and apparently the fix was to replace the system board so we got it back with everything set to default. I fixed this and re-enabled the node and all seems well again.

On 10/28/2017 02:07 AM, Juan Antonio Osorio wrote:
Thanks for the postmortem; it's always a good read tp learn stuff :)

On 28 Oct 2017 00:11, "Ben Nemec" < <>> wrote:

    Hi all,

    As you may or may not have noticed all ovb jobs on rh1 started
    failing sometime last night.  After some investigation today I found
    a few issues.

    First, our nova db archiving wasn't working.  This was due to the
    auto-increment counter issue described by melwitt in
<> Deleting the problematic rows from the shadow table got us past that.

    On another db-related note, we seem to have turned ceilometer back
    on at some point in rh1.  I think that was intentional to avoid
notification queues backing up, but it led to a different problem. We had approximately 400 GB of mongodb data from ceilometer that we
    don't actually care about.  I cleaned that up and set a TTL in
    ceilometer so hopefully this won't happen again.

Is there an alarm or something we could set to get notified about this kind of stuff? Or better yet, something we could automate to avoid this? What's usimg mongodb nowadays?

Setting a TTL should avoid this in the future. Note that I don't think mongo is still used by default, but in our old Mitaka version it was.

For the nova archiving thing I think we'd have to set up email notifications for failed cron jobs. That would be a good RFE.

    Unfortunately neither of these things completely resolved the
    extreme slowness in the cloud that was causing every testenv to
    fail.  After trying a number of things that made no difference, the
    culprit seems to have been rabbitmq.  There was nothing obviously
    wrong with it according to the web interface, the queues were all
    short and messages seemed to be getting delivered.  However, when I
    ran rabbitmqctl status at the CLI it reported that the node was
    down.  Since something was clearly wrong I went ahead and restarted
    it.  After that everything seems to be back to normal.

Same questiom as above, could we set and alarm or automate the node recovery?

On this one I have no idea. As I noted, when I looked at the rabbit web ui everything looked fine. This isn't like the notification queue problem where one look at the queue lengths made it obvious something was wrong. Messages were being delivered successfully, just very, very slowly. Maybe looking at messages per second would help, but that would be hard to automate. You'd have to know if there were few messages going through because of performance issues or if the cloud is just under light load.

I guess it's also worth noting that at some point this cloud is going away in favor of RDO cloud. Of course we said that back in December when we discussed the OVS port exhaustion issue and now 11 months later it still hasn't happened. That's why I haven't been too inclined to pursue extensive monitoring for the existing cloud though.

    I'm not sure exactly what the cause of all this was.  We did get
    kind of inundated with jobs yesterday after a zuul restart which I
    think is what probably pushed us over the edge, but that has
    happened before without bringing the cloud down.  It was probably a
    combination of some previously unnoticed issues stacking up over
    time and the large number of testenvs requested all at once.

    In any case, testenvs are creating successfully again and the jobs
    in the queue look good so far.  If you notice any problems please
    let me know though.  I'm hoping this will help with the job
    timeouts, but that remains to be seen.


    OpenStack Development Mailing List (not for usage questions)

OpenStack Development Mailing List (not for usage questions)

OpenStack Development Mailing List (not for usage questions)

Reply via email to