------ Robert Kluin Ezox Systems, LLC
On Mon, Aug 29, 2011 at 16:34, Robert Kluin <[email protected]> wrote: > Hi Ikai, > Thanks for the information; it is greatly appreciated. > > So just to be clear, had high-replication applications been serving > from that data-center, they still would have experienced some > downtime, but the downtime would have been considerably less? For > high-replication applications, is the decision to switch datacenters > purely automated or does a human need to evaluate the situation and > make the decision? > > > > > Robert > > > > > > > On Mon, Aug 29, 2011 at 16:08, Ikai Lan <[email protected]> wrote: >> Postmortem >> >> This document details the cause and events occurring immediately after App >> Engine's outage on August 18th, 2011 impacting applications running the >> Master/Slave datastore. >> >> Summary >> >> On August 18th, 2011, a Google data center in the American Midwest, which >> was serving App Engine Master/Slave Datastore applications on that date, >> lost utility power as a result of an intense thunderstorm. Power >> distribution equipment in the datacenter failed in the wake of the loss of >> utility power, which powered off a subset of the machines in the datacenter. >> >> The power loss to the affected machines both reduced the available computing >> capacity in the data center, and took offline parts of the storage >> infrastructure, causing Master/Slave Datastore applications to experience >> high latency, serve errors, or be completely unavailable. When Google’s data >> center operations team reported that it would be several hours before they >> would be able to restore power to the affected machines due to the ongoing >> thunderstorm, the App Engine team decided to perform an emergency failover >> from the serving data center to the backup data center for Master/Slave >> Datastore applications. >> >> High Replication Datastore applications were not serving from this data >> center on that date, and were therefore unaffected by this outage. >> >> Background >> >> During data center outages affecting Master/Slave Datastore applications, >> the App Engine team’s only options are to weather the outage in the current >> location, providing a degraded experience to our customers, or to perform an >> emergency maintenance to relocate Master/Slave Datastore applications to the >> backup datacenter. Emergency maintenance procedures do not allow time to >> fully replicate between the current serving data center for Master/Slave >> Datastore applications, and the backup data center, because the storage >> system in the current serving data center is severely degraded or offline. >> Thus, data written recently to the current Master/Slave Datastore data >> center is temporarily stranded, and the Master/Slave Datastore appears to >> jump backwards in time when it returns to service. >> >> The App Engine team does not make the decision to perform an emergency >> relocation without replication lightly. It is extremely disruptive to >> applications to begin serving your App Engine Master/Slave Datastore >> application without the most recent data that your application committed to >> the Datastore. After the emergency maintenance, the App Engine team must >> audit and repair the Master/Slave Datastore, to determine the scope of the >> unreplicated data stranded in the affected data center, and provide the >> unreplicated data to application owners, so they may choose to re-integrate >> it into their application’s Datastore, if they so desire. This is clearly an >> undesirable outcome for both the App Engine team and the application owner. >> >> As such, App Engine’s policies during power outages or severe disruptions >> affecting the Master/Slave Datastore, are to weather the outage in place for >> up to an hour, unless the team is informed that the return to service will >> definitely not begin within that hour. This policy was defined after >> examining the historical record of outages and return to service times in >> Google data center, and assessing the likelihood of a quick return to >> service versus the adverse effects of performing an emergency maintenance. >> Google’s data center operations team is highly competent at returning >> datacenters to service quickly and safely, and their abilities are leveraged >> to the benefit of App Engine customers. >> >> During this outage, the impact of the adverse weather conditions continued >> for much longer than the App Engine team had anticipated, and made it >> impossible for the data center operations team to safely begin the repair >> process until the storm ended. As there was no estimated time for the data >> center to return to service at that point, the App Engine team elected to >> perform an emergency maintenance to switch Master/Slave Datastore >> applications to their backup data center, returning them to service with >> some amount of unreplicated data. >> >> Remediation >> >> The architecture of the Master/Slave Datastore for App Engine makes no >> substantial improvement in this situation possible. The Master/Slave >> Datastore is serves out of a single primary data center, with asynchronous >> delayed replication to a backup data center, and is always vulnerable to >> unexpected outages in its primary data center. >> >> The normal maintenance procedure to switch Master/Slave Datastore >> applications from serving to backup data center requires an hour of >> read-only time to complete. While it would be possible to pre-emptively >> perform a normal maintenance procedure to switch from serving to backup data >> center for Master/Slave Datastore applications when adverse weather >> conditions are expected, the majority of the time, adverse weather does not >> result in a service outage. Implementing this policy would result in a far >> greater amount of Master/Slave Datastore read-only periods without a >> guarantee of reduced unplanned outages. This policy would provide no >> protection against outages that occur without sufficient forewarning, e.g. >> fire or loss of network connectivity. >> >> Recommendations >> >> The High Replication Datastore for App Engine applications is specifically >> engineered to be resilient in the face of sudden outages affecting one or >> more data centers. Data written to the High Replication Datastore is >> synchronously replicated to multiple datacenters before App Engine indicates >> success to your application. >> >> Had High Replication Datastore applications been serving out of this data >> center, it’s entirely possible they would have experienced minimal or no >> degradation or outage. Additionally, the App Engine team could have ceased >> all serving from that data center within minutes, without any temporary >> stranding of data, or other adverse events. >> >> The Google App Engine team encourages all App Engine customers to migrate >> their applications from the Master/Slave Datastore to the High Replication >> Datastore. The High Replication Datastore is now the default for new App >> Engine applications, the SLA available under the upcoming new pricing model >> applies exclusively to High Replication Datastore applications, and we are >> testing improved migration tools with early adopters now. (You can sign up >> to be an early adopter at this link: http://goo.gl/3jrXu ) In addition, new >> App Engine features, e.g. Go, Python 2.7 will be available exclusively to >> High Replication Datastore applications. >> >> Timeline (all times US/Pacific) >> >> 5:35pm: Google data center loses computing and storage capability, as as >> result of loss of utility power due to severe thunderstorm in the area. >> Google data center operations team begins responding to outage, in contact >> with the App Engine team. >> 6:50pm: Google data center operations reports data center will not return to >> service promptly. App Engine team begins emergency maintenance to switch >> Master/Slave Datastore applications to backup data center. >> 7:10pm: appengine-downtime-notify forum is notified: >> https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI >> 7:20pm: Master/Slave Datastore applications begin serving in read-only mode >> during the emergency maintenance. >> 7:50pm: App Engine team completes emergency maintenance, and Master/Slave >> Datastore applications are serving normally again. >> >> -- Ikai Lan, on behalf of the App Engine team >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine Downtime Notify" group. >> To post to this group, send email to >> [email protected]. >> To unsubscribe from this group, send email to >> [email protected]. >> For more options, visit this group at >> http://groups.google.com/group/google-appengine-downtime-notify?hl=en. >> > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
