[google-appengine] Re: Postmortem for August 18, 2011 outage

Robert Kluin Mon, 29 Aug 2011 13:34:42 -0700

------
Robert Kluin
Ezox Systems, LLC







On Mon, Aug 29, 2011 at 16:34, Robert Kluin <[email protected]> wrote:
> Hi Ikai,
>  Thanks for the information; it is greatly appreciated.
>
>  So just to be clear, had high-replication applications been serving
> from that data-center, they still would have experienced some
> downtime, but the downtime would have been considerably less?  For
> high-replication applications, is the decision to switch datacenters
> purely automated or does a human need to evaluate the situation and
> make the decision?
>
>
>
>
> Robert
>
>
>
>
>
>
> On Mon, Aug 29, 2011 at 16:08, Ikai Lan <[email protected]> wrote:
>> Postmortem
>>
>> This document details the cause and events occurring immediately after App
>> Engine's outage on August 18th, 2011 impacting applications running the
>> Master/Slave datastore.
>>
>> Summary
>>
>> On August 18th, 2011, a Google data center in the American Midwest, which
>> was serving App Engine Master/Slave Datastore applications on that date,
>> lost utility power as a result of an intense thunderstorm. Power
>> distribution equipment in the datacenter failed in the wake of the loss of
>> utility power, which powered off a subset of the machines in the datacenter.
>>
>> The power loss to the affected machines both reduced the available computing
>> capacity in the data center, and took offline parts of the storage
>> infrastructure, causing Master/Slave Datastore applications to experience
>> high latency, serve errors, or be completely unavailable. When Google’s data
>> center operations team reported that it would be several hours before they
>> would be able to restore power to the affected machines due to the ongoing
>> thunderstorm, the App Engine team decided to perform an emergency failover
>> from the serving data center to the backup data center for Master/Slave
>> Datastore applications.
>>
>> High Replication Datastore applications were not serving from this data
>> center on that date, and were therefore unaffected by this outage.
>>
>> Background
>>
>> During data center outages affecting Master/Slave Datastore applications,
>> the App Engine team’s only options are to weather the outage in the current
>> location, providing a degraded experience to our customers, or to perform an
>> emergency maintenance to relocate Master/Slave Datastore applications to the
>> backup datacenter. Emergency maintenance procedures do not allow time to
>> fully replicate between the current serving data center for Master/Slave
>> Datastore applications, and the backup data center, because the storage
>> system in the current serving data center is severely degraded or offline.
>> Thus, data written recently to the current Master/Slave Datastore data
>> center is temporarily stranded, and the Master/Slave Datastore appears to
>> jump backwards in time when it returns to service.
>>
>> The App Engine team does not make the decision to perform an emergency
>> relocation without replication lightly. It is extremely disruptive to
>> applications to begin serving your App Engine Master/Slave Datastore
>> application without the most recent data that your application committed to
>> the Datastore. After the emergency maintenance, the App Engine team must
>> audit and repair the Master/Slave Datastore, to determine the scope of the
>> unreplicated data stranded in the affected data center, and provide the
>> unreplicated data to application owners, so they may choose to re-integrate
>> it into their application’s Datastore, if they so desire. This is clearly an
>> undesirable outcome for both the App Engine team and the application owner.
>>
>> As such, App Engine’s policies during power outages or severe disruptions
>> affecting the Master/Slave Datastore, are to weather the outage in place for
>> up to an hour, unless the team is informed that the return to service will
>> definitely not begin within that hour. This policy was defined after
>> examining the historical record of outages and return to service times in
>> Google data center, and assessing the likelihood of a quick return to
>> service versus the adverse effects of performing an emergency maintenance.
>> Google’s data center operations team is highly competent at returning
>> datacenters to service quickly and safely, and their abilities are leveraged
>> to the benefit of App Engine customers.
>>
>> During this outage, the impact of the adverse weather conditions continued
>> for much longer than the App Engine team had anticipated, and made it
>> impossible for the data center operations team to safely begin the repair
>> process until the storm ended. As there was no estimated time for the data
>> center to return to service at that point, the App Engine team elected to
>> perform an emergency maintenance to switch Master/Slave Datastore
>> applications to their backup data center, returning them to service with
>> some amount of unreplicated data.
>>
>> Remediation
>>
>> The architecture of the Master/Slave Datastore for App Engine makes no
>> substantial improvement in this situation possible. The Master/Slave
>> Datastore is serves out of a single primary data center, with asynchronous
>> delayed replication to a backup data center, and is always vulnerable to
>> unexpected outages in its primary data center.
>>
>> The normal maintenance procedure to switch Master/Slave Datastore
>> applications from serving to backup data center requires an hour of
>> read-only time to complete. While it would be possible to pre-emptively
>> perform a normal maintenance procedure to switch from serving to backup data
>> center for Master/Slave Datastore applications when adverse weather
>> conditions are expected, the majority of the time, adverse weather does not
>> result in a service outage. Implementing this policy would result in a far
>> greater amount of Master/Slave Datastore read-only periods without a
>> guarantee of reduced unplanned outages. This policy would provide no
>> protection against outages that occur without sufficient forewarning, e.g.
>> fire or loss of network connectivity.
>>
>> Recommendations
>>
>> The High Replication Datastore for App Engine applications is specifically
>> engineered to be resilient in the face of sudden outages affecting one or
>> more data centers. Data written to the High Replication Datastore is
>> synchronously replicated to multiple datacenters before App Engine indicates
>> success to your application.
>>
>> Had High Replication Datastore applications been serving out of this data
>> center, it’s entirely possible they would have experienced minimal or no
>> degradation or outage. Additionally, the App Engine team could have ceased
>> all serving from that data center within minutes, without any temporary
>> stranding of data, or other adverse events.
>>
>> The Google App Engine team encourages all App Engine customers to migrate
>> their applications from the Master/Slave Datastore to the High Replication
>> Datastore. The High Replication Datastore is now the default for new App
>> Engine applications, the SLA available under the upcoming new pricing model
>> applies exclusively to High Replication Datastore applications, and we are
>> testing improved migration tools with early adopters now. (You can sign up
>> to be an early adopter at this link: http://goo.gl/3jrXu ) In addition, new
>> App Engine features, e.g. Go, Python 2.7 will be available exclusively to
>> High Replication Datastore applications.
>>
>> Timeline (all times US/Pacific)
>>
>> 5:35pm: Google data center loses computing and storage capability, as as
>> result of loss of utility power due to severe thunderstorm in the area.
>> Google data center operations team begins responding to outage, in contact
>> with the App Engine team.
>> 6:50pm: Google data center operations reports data center will not return to
>> service promptly. App Engine team begins emergency maintenance to switch
>> Master/Slave Datastore applications to backup data center.
>> 7:10pm: appengine-downtime-notify forum is notified:
>> https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI
>> 7:20pm: Master/Slave Datastore applications begin serving in read-only mode
>> during the emergency maintenance.
>> 7:50pm: App Engine team completes emergency maintenance, and Master/Slave
>> Datastore applications are serving normally again.
>>
>> -- Ikai Lan, on behalf of the App Engine team
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Google App Engine Downtime Notify" group.
>> To post to this group, send email to
>> [email protected].
>> To unsubscribe from this group, send email to
>> [email protected].
>> For more options, visit this group at
>> http://groups.google.com/group/google-appengine-downtime-notify?hl=en.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Postmortem for August 18, 2011 outage

Reply via email to