Hi Ikai, Thanks for the information; it is greatly appreciated. So just to be clear, had high-replication applications been serving from that data-center, they still would have experienced some downtime, but the downtime would have been considerably less? For high-replication applications, is the decision to switch datacenters purely automated or does a human need to evaluate the situation and make the decision?
Robert On Mon, Aug 29, 2011 at 16:08, Ikai Lan <[email protected]> wrote: > Postmortem > > This document details the cause and events occurring immediately after App > Engine's outage on August 18th, 2011 impacting applications running the > Master/Slave datastore. > > Summary > > On August 18th, 2011, a Google data center in the American Midwest, which > was serving App Engine Master/Slave Datastore applications on that date, > lost utility power as a result of an intense thunderstorm. Power > distribution equipment in the datacenter failed in the wake of the loss of > utility power, which powered off a subset of the machines in the datacenter. > > The power loss to the affected machines both reduced the available computing > capacity in the data center, and took offline parts of the storage > infrastructure, causing Master/Slave Datastore applications to experience > high latency, serve errors, or be completely unavailable. When Google’s data > center operations team reported that it would be several hours before they > would be able to restore power to the affected machines due to the ongoing > thunderstorm, the App Engine team decided to perform an emergency failover > from the serving data center to the backup data center for Master/Slave > Datastore applications. > > High Replication Datastore applications were not serving from this data > center on that date, and were therefore unaffected by this outage. > > Background > > During data center outages affecting Master/Slave Datastore applications, > the App Engine team’s only options are to weather the outage in the current > location, providing a degraded experience to our customers, or to perform an > emergency maintenance to relocate Master/Slave Datastore applications to the > backup datacenter. Emergency maintenance procedures do not allow time to > fully replicate between the current serving data center for Master/Slave > Datastore applications, and the backup data center, because the storage > system in the current serving data center is severely degraded or offline. > Thus, data written recently to the current Master/Slave Datastore data > center is temporarily stranded, and the Master/Slave Datastore appears to > jump backwards in time when it returns to service. > > The App Engine team does not make the decision to perform an emergency > relocation without replication lightly. It is extremely disruptive to > applications to begin serving your App Engine Master/Slave Datastore > application without the most recent data that your application committed to > the Datastore. After the emergency maintenance, the App Engine team must > audit and repair the Master/Slave Datastore, to determine the scope of the > unreplicated data stranded in the affected data center, and provide the > unreplicated data to application owners, so they may choose to re-integrate > it into their application’s Datastore, if they so desire. This is clearly an > undesirable outcome for both the App Engine team and the application owner. > > As such, App Engine’s policies during power outages or severe disruptions > affecting the Master/Slave Datastore, are to weather the outage in place for > up to an hour, unless the team is informed that the return to service will > definitely not begin within that hour. This policy was defined after > examining the historical record of outages and return to service times in > Google data center, and assessing the likelihood of a quick return to > service versus the adverse effects of performing an emergency maintenance. > Google’s data center operations team is highly competent at returning > datacenters to service quickly and safely, and their abilities are leveraged > to the benefit of App Engine customers. > > During this outage, the impact of the adverse weather conditions continued > for much longer than the App Engine team had anticipated, and made it > impossible for the data center operations team to safely begin the repair > process until the storm ended. As there was no estimated time for the data > center to return to service at that point, the App Engine team elected to > perform an emergency maintenance to switch Master/Slave Datastore > applications to their backup data center, returning them to service with > some amount of unreplicated data. > > Remediation > > The architecture of the Master/Slave Datastore for App Engine makes no > substantial improvement in this situation possible. The Master/Slave > Datastore is serves out of a single primary data center, with asynchronous > delayed replication to a backup data center, and is always vulnerable to > unexpected outages in its primary data center. > > The normal maintenance procedure to switch Master/Slave Datastore > applications from serving to backup data center requires an hour of > read-only time to complete. While it would be possible to pre-emptively > perform a normal maintenance procedure to switch from serving to backup data > center for Master/Slave Datastore applications when adverse weather > conditions are expected, the majority of the time, adverse weather does not > result in a service outage. Implementing this policy would result in a far > greater amount of Master/Slave Datastore read-only periods without a > guarantee of reduced unplanned outages. This policy would provide no > protection against outages that occur without sufficient forewarning, e.g. > fire or loss of network connectivity. > > Recommendations > > The High Replication Datastore for App Engine applications is specifically > engineered to be resilient in the face of sudden outages affecting one or > more data centers. Data written to the High Replication Datastore is > synchronously replicated to multiple datacenters before App Engine indicates > success to your application. > > Had High Replication Datastore applications been serving out of this data > center, it’s entirely possible they would have experienced minimal or no > degradation or outage. Additionally, the App Engine team could have ceased > all serving from that data center within minutes, without any temporary > stranding of data, or other adverse events. > > The Google App Engine team encourages all App Engine customers to migrate > their applications from the Master/Slave Datastore to the High Replication > Datastore. The High Replication Datastore is now the default for new App > Engine applications, the SLA available under the upcoming new pricing model > applies exclusively to High Replication Datastore applications, and we are > testing improved migration tools with early adopters now. (You can sign up > to be an early adopter at this link: http://goo.gl/3jrXu ) In addition, new > App Engine features, e.g. Go, Python 2.7 will be available exclusively to > High Replication Datastore applications. > > Timeline (all times US/Pacific) > > 5:35pm: Google data center loses computing and storage capability, as as > result of loss of utility power due to severe thunderstorm in the area. > Google data center operations team begins responding to outage, in contact > with the App Engine team. > 6:50pm: Google data center operations reports data center will not return to > service promptly. App Engine team begins emergency maintenance to switch > Master/Slave Datastore applications to backup data center. > 7:10pm: appengine-downtime-notify forum is notified: > https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI > 7:20pm: Master/Slave Datastore applications begin serving in read-only mode > during the emergency maintenance. > 7:50pm: App Engine team completes emergency maintenance, and Master/Slave > Datastore applications are serving normally again. > > -- Ikai Lan, on behalf of the App Engine team > > -- > You received this message because you are subscribed to the Google Groups > "Google App Engine Downtime Notify" group. > To post to this group, send email to > [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/google-appengine-downtime-notify?hl=en. > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
