Hello everyone,

As you might have noticed, we had a major issue in the GNOME infrastructure 
last night, which extended as far as to render almost every service we provide 
unavailable.
This was caused by our main file server stopping to serve the file systems 
required for home directories and mailing lists.

The cause about the outage is current not clear as the logs are not showing up 
anything relevant.
We've sent them to gluster engineers to ask them for help on analyzing them.

On rebooting the server, something went wrong, requiring a powercycle of the 
affected machine.
When trying this, we were hit by a bug in the management cards that made us 
unable to use them to reboot the server.

Because of this, we have requested hands-on service to get the server power 
cycled, which had us waiting for some time.
Within minutes after the server was rebooted, the file systems came back 
online, and with it all of the GNOME services.

To prevent all services from going down when the primary file server would go 
down, we had previously setup a synchronized secondary file server.
The reason we were unable to make all servers fallback to this one was because 
we weren't able to login to the affected servers to update the target IP.

To prevent this problem from pulling down the entire GNOME infrastructure in 
the future, we have taken some steps:
    - We have added a way for us to login to any server even if the home 
directories are down.
    - We'll be introducing automatic failover to the other available file server
    - We'll be spreading our documentation off-site to prevent the relevant 
documentation to disappear when the machine hosting 
     is experiencing problems
     - We will be making sure to get access to the power management to our 
servers, so we can reboot them even if the management
     cards are not functioning

We really hope that this will prevent such drastic failures in the future, and 
make it easier to recover if problems do occur.

If you have any additional questions, don't hesitate to contact either of us on 
IRC (#sysadmin) or by sending us an email.

With kind regards,
Patrick Uiterwijk and Andrea Veri
System Administrators, GNOME
_______________________________________________
foundation-list mailing list
foundation-list@gnome.org
https://mail.gnome.org/mailman/listinfo/foundation-list

Reply via email to