Team, As many of you are aware we had downtime for a few hours this morning. I wanted to document the what, when and why so that everyone understands what the issue was, how it was handled, and what has and is being done to avoid it in the future. For those that were involved, if I've missed anything please feel free to append to the thread so that we get all the details accurate.
Over the last 24hrs each of our Red Hat servers received kernel updates from the RHN. In order to complete these updates, each box needed to be rebooted. We discussed (in IRC) which boxes should be rebooted first based on security concerns. window and container were at the top of that list, due to these boxes having the most direct console interaction with users. window was rebooted first and came back up promptly about 7:15am (MST). container was rebooted second at about 7:45 (MST) and did *not* come back up until 10:00 (MST). This required manual intervention by Owen, who tried console and KVM access and eventually had to get data-center staff to manually reboot the machine. Downtime was not limited to container only as it hosts the home folders for all machines and exports them via NFS. With container being down, many other services were affected and no one was able to login to any other server. Known affected services were git, www.gnome.org (why?) and mail (why?). bugzilla and the wiki were unaffected during this outage. The reason container didn't come back up was an SELinux issue. Upon rebooting the system halted with the error: "Unable to load SELinux Policy. Machine is enforcing Mode. Halting now." It is unknown why SElinux was active on container while it is disabled on all other hosts. SELinux has now been disabled on container by appending "selinux=0" to the kernel in grub as well as defining "selinux=disabled" in /etc/sysconfig/selinux. It has also been verified that all other hosts have SELinux disabled so, again, it is a mystery as to why it may have been enabled on container. This downtime brought two issues to the forefront that I think need to be addressed. 1) We need new hardware! Other than the server that Jeff donated, just about everything is out of warranty. We are seriously asking for trouble running critical services on machines that are no longer supported. It is only a matter of time until one of these boxes goes down for good. 2) We need out-of-band access to the hardware. The solution to our problem today required Owen. Unless I am mistaken, he is the only admin with console/kvm access to any hardware, and is the only one able to file tickets with Red Hat IT. I propose that out-of-band access be configured and allowed by more admins to our servers. This alleviates Owen as the single point of contact between the team and Red Hat IT, allows us to connect to hardware via more reliable methods, and allows us to respond more quickly to issues such as this. Do we have a solution for todays problem if Owen had been unavailable? I know Paul has been working on a proposal for replacement hardware, and I think that should be a priority. If anyone can give him any additional reasons to present to the board, please comment here or contact him directly. If we plan to move forward and really solidify the infrastructure we need new, supported hardware. I do want to thank Owen for being so prompt in attending to this issue and communicating with the team with regular status reports. I'm glad we were able to resolve the issues relatively quickly. I hope everyone can look at this situation not as a failure of service, but as critical lessons learned toward improving our infrastructure and shoring up problems. As usual, if you have any questions or comments about todays downtime please feel free to contact me. Christer _______________________________________________ gnome-infrastructure mailing list [email protected] http://mail.gnome.org/mailman/listinfo/gnome-infrastructure
