Kudos for providing a good incident report and explanation of the cause of the problem.
I know little about this actual incident, but that it not important. It is really good to see constructive responses to problems like this with a focus on inprovement and problem prevention in the future. Dave On Thu, Sep 21, 2017 at 03:35:19PM +0200, Christophe Demarey wrote: > July 28, 2017 Incident Report > > This summer, we experienced a hard incident on the Continuous Integration > service infrastructure.Today we???re providing an incident report that > details the nature of the incident and our response. > We understand this service issue has impacted all Inria developers using the > CI service and their valued time, and we apologize to everyone who was > affected. > > Issue summary > Due to a manipulation error, a lot of Virtual Machines (VM) hosted on > CloudStack were destroyed and not recoverable! > Jenkins servers are not concerned by this issue, so all Jenkins jobs and > history are safe. > > Root cause > After the successful migration of Jenkins servers and security fixes (July > 18-19, 2017), a few projects were not able to reach their slaves hosted on > the CI build farm. This problem was due to a synchronization problem between > the CI database and CloudStack (powering the CI build farm) having its own > way to manage projects and users (through domains). > An attempt to reproduce and debug this problem on the qualification > infrastructure failed. So, we added some logging on the production > infrastructure. To avoid troubles to the production infrastructure, we > limited the synchronization to one project. It was the mistake! > The synchronization of one project led to the deletion of all other > CloudStack domains (i.e. projects). Indeed, the synchronization code expected > to get the full environment (all CI projects) and if it finds a domain not > bound to a CI project, it deletes it... > The synchronization process was aborted before its termination but it was too > late. Some user Virtual Machines were still alive during some hours but were > finally purged by CloudStack. > It means we lost most VM and templates hosted on the CI build farm. > > Resolution and recovery > It was impossible to recover destroyed virtual machines. CloudStack is > configured to keep VM data 24 hours before actually destroying it but it does > not work when the domain hosting the VM is destroyed. > Primary storage hosting running VM is a high-performance and very expensive > storage. That???s why the CI team chose (at the CI service setup) to do not > backup VM but rather to rely on both the expunge delay and the snapshot / > template mechanism to save VM state. This mechanism was useless in relation > to this incident. > Templates and snapshots are hosted on the secondary storage that is a > redundant storage in two different buildings to ensure data reliability and > recovery. The incident led CloudStack to perform a ?? clean ?? deletion of > all the domain data including templates. That???s why they also became > unavailable. > > We were able to rebuild all domains from the CI database but CI service users > had to create new VM to replace the destroyed ones. > > Corrective and preventative measures > All members of the CI team (DSI, SED) worked and are still working all > together to find the best solution to mitigate the incident and prevent same > situations in the future. > The synchronization code responsible of the deletion of CloudStack domain has > been deactivated. CloudStack domain deletion is a critical action and will no > longer be automated. Deletions will be reviewed and approved by the CI team > before being completed. > This incident showed us that backup mechanism in place are not strong enough > and we are now evaluating the cost to backup, with history: > all VM, snapshots and templates or > all snapshots and templates. > We are also working on providing a way to download templates created on > CloudStack so that you can easily get a copy of them. We encourage you to > create templates for virtual machines that are time consuming to set up from > scratch. > > > Sincerely, > The CI Team
