Short version: The RH1 CI region has been down since yesterday afternoon.
We have a misbehaving switch and have file a support ticket with the vendor to troubleshoot things further. We hope to know more this weekend, or Monday at the latest. Long version: Yesterday afternoon we started seeing issues in scheduling jobs on the RH1 CI cloud. We haven't made any OpenStack configuration changes recently, and things have been quite stable for some time now (our uptime was 365 days on the controller). Initially we found a misconfigured Keystone URL which was preventing some diagnostic queries via OS clients external to the rack. This setting hadn't been recently changed however and didn't seem to bother nodepool before so I don't think it is the cause of the outage... MySQL also got a bounce. It seemed happy enough after a restart as well. After fixing the keystone setting and bouncing MySQL instances appears to go ACTIVE but we were still having connectivity issues getting floating IPs and DHCP working on overcloud instances. After a good bit of debugging we started looking at the switches. Turns out one of them has a high CPU usuage (above the warning threshold) and MAC address are also unstable (ports are moving around). Until this is resolved RH1 is unavailable to host jobs CI jobs. Will post back here with an update once we have more information. Dan __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev