Hello folks, We had a brief outage yesterday that misc and I were working on fixing. We're committed to doing a formal post-mortem of outages whether it affects everyone or not, as a habit. Here's a post-mortem of yesterday's event.
## Affected Servers * salt-master.rax.gluster.org * syslog01.rax.gluster.org ## Total Duration ~4 hours ## What Happened A few Rackspace servers depend on DHCP (default rackspace setup). Due to Centos 7.4 upgrade, we rebooted some server, since kernel and other packages were upgraded. At this point, we're unsure if this is a DHCP bug, an upgrade gone wrong, or if Rackspace DHCP servers are at fault. We will be looking into this in the coming days. Michael had issues with the Rackspace console, Nigel stepped in to help with the outage. Once we accesses the machine via Emergency Console, we spent some trying to get a DHCP lease. When that didn't work, we started working on setting up a static IP and gateway. This took a few tries since the Rackspace documentation for doing this was wrong. There's also a slight difference between "ip" and "ifconfig" further creating confusion. This is what we eventually did on one of the servers: ip address add 162.209.109.18/24 dev eth0 route add default gw 162.209.109.1 This incident did not affect any of our critical services. Gerrit, Jenkins, and download.gluster.org remained unaffected during this period. We were limited in our ability to roll out any changes via ansible to these servers during this ~4h window. We have a second server in progress for deploying infrastructure but the setup is not ready yet. Manual roll-out from sysadmins laptop was always possible in case of trouble. ## Timeline of Events Note: All times in (CEDT) * 09:00 am: Nigel and Michael are planning a new http server inside the cage for logs, packages, and Coverity scans. * 10:00 am: Michael starts the ansible process to install the new server * 12:10 am: The topic of Centos 7.4 upgrade come during discussion and Michael does an upgrade and reboot on the salt-master.rax.gluster.org. * 12:15 pm: Michael notices that the salt-master server is not coming back. Nigel confirms. * 12:15 pm: Nigel logs into Rackspace and does a hard restart on the salt-master machine. No luck. * 12:34 pm: Nigel logs a ticket with Rackspace about the server outage. * 12:44 pm: Nigel starts chat conversation with Rackspace support for escalation. Customer support engineer informs us that the server is up and can be accessed via Emergency Console. * 12:57 pm: Nigel gains access via Emergency Console. Michael's initial RCA of the isssue is a network problem caused by upgrade. Nigel confirms the RCA by verifying that eth0 does not have a public IP. Nigel tries to get the IP address to stick with the right gateway. * 12:35 pm: Nigel manages to get salt-master online briefly. * 13:34 pm: Nigel brings the salt-master back online. * 13:40 pm: Michael try to upgrade the syslog server, reboot it, not coming up either * 13:55 pm: Nigel brings back syslog back online as well. ## Pending Actions * Michael to figure out if there is a bug in the new DHCP daemon, or if things changed Rackspace side. * Michael to finish move of salt-master into the cage (ant-queen.int.rht.gluster.org) to prevent further issues. * Nigel to send a note to Rackspace support to fix their documentation. -- Nigel and Michael Gluster Infra Team
_______________________________________________ Gluster-infra mailing list Gluster-infra@gluster.org http://lists.gluster.org/mailman/listinfo/gluster-infra