Kudos for providing a good incident report and explanation of the cause of the 
problem.

I know little about this actual incident, but that it not important. It is 
really good
to see constructive responses to problems like this with a focus on inprovement 
and
problem prevention in the future.

Dave

On Thu, Sep 21, 2017 at 03:35:19PM +0200, Christophe Demarey wrote:
> July 28, 2017 Incident Report
> 
> This summer, we experienced a hard incident on the Continuous Integration 
> service infrastructure.Today we???re providing an incident report that 
> details the nature of the incident and our response.
> We understand this service issue has impacted all Inria developers using the 
> CI service and their valued time, and we apologize to everyone who was 
> affected.
> 
> Issue summary
> Due to a manipulation error, a lot of Virtual Machines (VM) hosted on 
> CloudStack were destroyed and not recoverable!
> Jenkins servers are not concerned by this issue, so all Jenkins jobs and 
> history are safe. 
> 
> Root cause
> After the successful migration of Jenkins servers and security fixes (July 
> 18-19, 2017), a few projects were not able to reach their slaves hosted on 
> the CI build farm. This problem was due to a synchronization problem between 
> the CI database and CloudStack (powering the CI build farm) having its own 
> way to manage projects and users (through domains).
> An attempt to reproduce and debug this problem on the qualification 
> infrastructure failed. So, we added some logging on the production 
> infrastructure. To avoid troubles to the production infrastructure, we 
> limited the synchronization to one project. It was the mistake!
> The synchronization of one project led to the deletion of all other 
> CloudStack domains (i.e. projects). Indeed, the synchronization code expected 
> to get the full environment (all CI projects) and if it finds a domain not 
> bound to a CI project, it deletes it...
> The synchronization process was aborted before its termination but it was too 
> late. Some user Virtual Machines were still alive during some hours but were 
> finally purged by CloudStack. 
> It means we lost most VM and templates hosted on the CI build farm.
> 
> Resolution and recovery
> It was impossible to recover destroyed virtual machines. CloudStack is 
> configured to keep VM data 24 hours before actually destroying it but it does 
> not work when the domain hosting the VM is destroyed.
> Primary storage hosting running VM is a high-performance and very expensive 
> storage. That???s why the CI team chose (at the CI service setup) to do not 
> backup VM but rather to rely on both the expunge delay and the snapshot / 
> template mechanism to save VM state. This mechanism was useless in relation 
> to this incident.
> Templates and snapshots are hosted on the secondary storage that is a 
> redundant storage in two different buildings to ensure data reliability and 
> recovery. The incident led CloudStack to perform a ?? clean ?? deletion of 
> all the domain data including templates. That???s why they also became 
> unavailable.
> 
> We were able to rebuild all domains from the CI database but CI service users 
> had to create new VM to replace the destroyed ones.
> 
> Corrective and preventative measures
> All members of the CI team (DSI, SED) worked and are still working all 
> together to find the best solution to mitigate the incident and prevent same 
> situations in the future.
> The synchronization code responsible of the deletion of CloudStack domain has 
> been deactivated. CloudStack domain deletion is a critical action and will no 
> longer be automated. Deletions will be reviewed and approved by the CI team 
> before being completed.
> This incident showed us that backup mechanism in place are not strong enough 
> and we are now evaluating the cost to backup, with history:
> all VM, snapshots and templates or
> all snapshots and templates.
> We are also working on providing a way to download templates created on 
> CloudStack so that you can easily get a copy of them. We encourage you to 
> create templates for virtual machines that are time consuming to set up from 
> scratch.
> 
> 
> Sincerely,
> The CI Team

Reply via email to