[google-appengine] Re: Postmortem for August 18, 2011 outage

Robert Kluin Mon, 29 Aug 2011 13:35:22 -0700

Hi Ikai,
 Thanks for the information; it is greatly appreciated.

 So just to be clear, had high-replication applications been serving
from that data-center, they still would have experienced some
downtime, but the downtime would have been considerably less?  For
high-replication applications, is the decision to switch datacenters
purely automated or does a human need to evaluate the situation and
make the decision?





Robert


On Mon, Aug 29, 2011 at 16:08, Ikai Lan <[email protected]> wrote:
> Postmortem
>
> This document details the cause and events occurring immediately after App
> Engine's outage on August 18th, 2011 impacting applications running the
> Master/Slave datastore.
>
> Summary
>
> On August 18th, 2011, a Google data center in the American Midwest, which
> was serving App Engine Master/Slave Datastore applications on that date,
> lost utility power as a result of an intense thunderstorm. Power
> distribution equipment in the datacenter failed in the wake of the loss of
> utility power, which powered off a subset of the machines in the datacenter.
>
> The power loss to the affected machines both reduced the available computing
> capacity in the data center, and took offline parts of the storage
> infrastructure, causing Master/Slave Datastore applications to experience
> high latency, serve errors, or be completely unavailable. When Google’s data
> center operations team reported that it would be several hours before they
> would be able to restore power to the affected machines due to the ongoing
> thunderstorm, the App Engine team decided to perform an emergency failover
> from the serving data center to the backup data center for Master/Slave
> Datastore applications.
>
> High Replication Datastore applications were not serving from this data
> center on that date, and were therefore unaffected by this outage.
>
> Background
>
> During data center outages affecting Master/Slave Datastore applications,
> the App Engine team’s only options are to weather the outage in the current
> location, providing a degraded experience to our customers, or to perform an
> emergency maintenance to relocate Master/Slave Datastore applications to the
> backup datacenter. Emergency maintenance procedures do not allow time to
> fully replicate between the current serving data center for Master/Slave
> Datastore applications, and the backup data center, because the storage
> system in the current serving data center is severely degraded or offline.
> Thus, data written recently to the current Master/Slave Datastore data
> center is temporarily stranded, and the Master/Slave Datastore appears to
> jump backwards in time when it returns to service.
>
> The App Engine team does not make the decision to perform an emergency
> relocation without replication lightly. It is extremely disruptive to
> applications to begin serving your App Engine Master/Slave Datastore
> application without the most recent data that your application committed to
> the Datastore. After the emergency maintenance, the App Engine team must
> audit and repair the Master/Slave Datastore, to determine the scope of the
> unreplicated data stranded in the affected data center, and provide the
> unreplicated data to application owners, so they may choose to re-integrate
> it into their application’s Datastore, if they so desire. This is clearly an
> undesirable outcome for both the App Engine team and the application owner.
>
> As such, App Engine’s policies during power outages or severe disruptions
> affecting the Master/Slave Datastore, are to weather the outage in place for
> up to an hour, unless the team is informed that the return to service will
> definitely not begin within that hour. This policy was defined after
> examining the historical record of outages and return to service times in
> Google data center, and assessing the likelihood of a quick return to
> service versus the adverse effects of performing an emergency maintenance.
> Google’s data center operations team is highly competent at returning
> datacenters to service quickly and safely, and their abilities are leveraged
> to the benefit of App Engine customers.
>
> During this outage, the impact of the adverse weather conditions continued
> for much longer than the App Engine team had anticipated, and made it
> impossible for the data center operations team to safely begin the repair
> process until the storm ended. As there was no estimated time for the data
> center to return to service at that point, the App Engine team elected to
> perform an emergency maintenance to switch Master/Slave Datastore
> applications to their backup data center, returning them to service with
> some amount of unreplicated data.
>
> Remediation
>
> The architecture of the Master/Slave Datastore for App Engine makes no
> substantial improvement in this situation possible. The Master/Slave
> Datastore is serves out of a single primary data center, with asynchronous
> delayed replication to a backup data center, and is always vulnerable to
> unexpected outages in its primary data center.
>
> The normal maintenance procedure to switch Master/Slave Datastore
> applications from serving to backup data center requires an hour of
> read-only time to complete. While it would be possible to pre-emptively
> perform a normal maintenance procedure to switch from serving to backup data
> center for Master/Slave Datastore applications when adverse weather
> conditions are expected, the majority of the time, adverse weather does not
> result in a service outage. Implementing this policy would result in a far
> greater amount of Master/Slave Datastore read-only periods without a
> guarantee of reduced unplanned outages. This policy would provide no
> protection against outages that occur without sufficient forewarning, e.g.
> fire or loss of network connectivity.
>
> Recommendations
>
> The High Replication Datastore for App Engine applications is specifically
> engineered to be resilient in the face of sudden outages affecting one or
> more data centers. Data written to the High Replication Datastore is
> synchronously replicated to multiple datacenters before App Engine indicates
> success to your application.
>
> Had High Replication Datastore applications been serving out of this data
> center, it’s entirely possible they would have experienced minimal or no
> degradation or outage. Additionally, the App Engine team could have ceased
> all serving from that data center within minutes, without any temporary
> stranding of data, or other adverse events.
>
> The Google App Engine team encourages all App Engine customers to migrate
> their applications from the Master/Slave Datastore to the High Replication
> Datastore. The High Replication Datastore is now the default for new App
> Engine applications, the SLA available under the upcoming new pricing model
> applies exclusively to High Replication Datastore applications, and we are
> testing improved migration tools with early adopters now. (You can sign up
> to be an early adopter at this link: http://goo.gl/3jrXu ) In addition, new
> App Engine features, e.g. Go, Python 2.7 will be available exclusively to
> High Replication Datastore applications.
>
> Timeline (all times US/Pacific)
>
> 5:35pm: Google data center loses computing and storage capability, as as
> result of loss of utility power due to severe thunderstorm in the area.
> Google data center operations team begins responding to outage, in contact
> with the App Engine team.
> 6:50pm: Google data center operations reports data center will not return to
> service promptly. App Engine team begins emergency maintenance to switch
> Master/Slave Datastore applications to backup data center.
> 7:10pm: appengine-downtime-notify forum is notified:
> https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI
> 7:20pm: Master/Slave Datastore applications begin serving in read-only mode
> during the emergency maintenance.
> 7:50pm: App Engine team completes emergency maintenance, and Master/Slave
> Datastore applications are serving normally again.
>
> -- Ikai Lan, on behalf of the App Engine team
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine Downtime Notify" group.
> To post to this group, send email to
> [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine-downtime-notify?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Postmortem for August 18, 2011 outage

Reply via email to