Re: [google-appengine] Re: Postmortem for August 18, 2011 outage

Ikai Lan (Google) Mon, 29 Aug 2011 15:36:09 -0700

Robert,

I'm going to try not to speculate too much, but our best guess is that high
replication applications would have experienced minor degradation. If you
have apps on high replication, there are periods where a local datastore may
be misbehaving, resulting in the serving applications making datastore calls
to a different datastore instance which may or may not be local to the
datacenter. A cross datacenter will introduce some latency to all datastore
operations, but to date, no high replication customers have reported error
periods as a result of local datastore spikes.


A failover would have taken minutes, however, we would not have had to put
datastores into read-only mode, so in effect, there would not have been
downtime or periods of prolonged latency spikes.

--
Ikai Lan
Developer Programs Engineer, Google App Engine
plus.ikailan.com | twitter.com/ikai



On Mon, Aug 29, 2011 at 1:35 PM, Robert Kluin <[email protected]>wrote:

> Hi Ikai,
>  Thanks for the information; it is greatly appreciated.
>
>  So just to be clear, had high-replication applications been serving
> from that data-center, they still would have experienced some
> downtime, but the downtime would have been considerably less?  For
> high-replication applications, is the decision to switch datacenters
> purely automated or does a human need to evaluate the situation and
> make the decision?
>
>
>
>
> Robert
>
>
> On Mon, Aug 29, 2011 at 16:08, Ikai Lan <[email protected]> wrote:
> > Postmortem
> >
> > This document details the cause and events occurring immediately after
> App
> > Engine's outage on August 18th, 2011 impacting applications running the
> > Master/Slave datastore.
> >
> > Summary
> >
> > On August 18th, 2011, a Google data center in the American Midwest, which
> > was serving App Engine Master/Slave Datastore applications on that date,
> > lost utility power as a result of an intense thunderstorm. Power
> > distribution equipment in the datacenter failed in the wake of the loss
> of
> > utility power, which powered off a subset of the machines in the
> datacenter.
> >
> > The power loss to the affected machines both reduced the available
> computing
> > capacity in the data center, and took offline parts of the storage
> > infrastructure, causing Master/Slave Datastore applications to experience
> > high latency, serve errors, or be completely unavailable. When Google’s
> data
> > center operations team reported that it would be several hours before
> they
> > would be able to restore power to the affected machines due to the
> ongoing
> > thunderstorm, the App Engine team decided to perform an emergency
> failover
> > from the serving data center to the backup data center for Master/Slave
> > Datastore applications.
> >
> > High Replication Datastore applications were not serving from this data
> > center on that date, and were therefore unaffected by this outage.
> >
> > Background
> >
> > During data center outages affecting Master/Slave Datastore applications,
> > the App Engine team’s only options are to weather the outage in the
> current
> > location, providing a degraded experience to our customers, or to perform
> an
> > emergency maintenance to relocate Master/Slave Datastore applications to
> the
> > backup datacenter. Emergency maintenance procedures do not allow time to
> > fully replicate between the current serving data center for Master/Slave
> > Datastore applications, and the backup data center, because the storage
> > system in the current serving data center is severely degraded or
> offline.
> > Thus, data written recently to the current Master/Slave Datastore data
> > center is temporarily stranded, and the Master/Slave Datastore appears to
> > jump backwards in time when it returns to service.
> >
> > The App Engine team does not make the decision to perform an emergency
> > relocation without replication lightly. It is extremely disruptive to
> > applications to begin serving your App Engine Master/Slave Datastore
> > application without the most recent data that your application committed
> to
> > the Datastore. After the emergency maintenance, the App Engine team must
> > audit and repair the Master/Slave Datastore, to determine the scope of
> the
> > unreplicated data stranded in the affected data center, and provide the
> > unreplicated data to application owners, so they may choose to
> re-integrate
> > it into their application’s Datastore, if they so desire. This is clearly
> an
> > undesirable outcome for both the App Engine team and the application
> owner.
> >
> > As such, App Engine’s policies during power outages or severe disruptions
> > affecting the Master/Slave Datastore, are to weather the outage in place
> for
> > up to an hour, unless the team is informed that the return to service
> will
> > definitely not begin within that hour. This policy was defined after
> > examining the historical record of outages and return to service times in
> > Google data center, and assessing the likelihood of a quick return to
> > service versus the adverse effects of performing an emergency
> maintenance.
> > Google’s data center operations team is highly competent at returning
> > datacenters to service quickly and safely, and their abilities are
> leveraged
> > to the benefit of App Engine customers.
> >
> > During this outage, the impact of the adverse weather conditions
> continued
> > for much longer than the App Engine team had anticipated, and made it
> > impossible for the data center operations team to safely begin the repair
> > process until the storm ended. As there was no estimated time for the
> data
> > center to return to service at that point, the App Engine team elected to
> > perform an emergency maintenance to switch Master/Slave Datastore
> > applications to their backup data center, returning them to service with
> > some amount of unreplicated data.
> >
> > Remediation
> >
> > The architecture of the Master/Slave Datastore for App Engine makes no
> > substantial improvement in this situation possible. The Master/Slave
> > Datastore is serves out of a single primary data center, with
> asynchronous
> > delayed replication to a backup data center, and is always vulnerable to
> > unexpected outages in its primary data center.
> >
> > The normal maintenance procedure to switch Master/Slave Datastore
> > applications from serving to backup data center requires an hour of
> > read-only time to complete. While it would be possible to pre-emptively
> > perform a normal maintenance procedure to switch from serving to backup
> data
> > center for Master/Slave Datastore applications when adverse weather
> > conditions are expected, the majority of the time, adverse weather does
> not
> > result in a service outage. Implementing this policy would result in a
> far
> > greater amount of Master/Slave Datastore read-only periods without a
> > guarantee of reduced unplanned outages. This policy would provide no
> > protection against outages that occur without sufficient forewarning,
> e.g.
> > fire or loss of network connectivity.
> >
> > Recommendations
> >
> > The High Replication Datastore for App Engine applications is
> specifically
> > engineered to be resilient in the face of sudden outages affecting one or
> > more data centers. Data written to the High Replication Datastore is
> > synchronously replicated to multiple datacenters before App Engine
> indicates
> > success to your application.
> >
> > Had High Replication Datastore applications been serving out of this data
> > center, it’s entirely possible they would have experienced minimal or no
> > degradation or outage. Additionally, the App Engine team could have
> ceased
> > all serving from that data center within minutes, without any temporary
> > stranding of data, or other adverse events.
> >
> > The Google App Engine team encourages all App Engine customers to migrate
> > their applications from the Master/Slave Datastore to the High
> Replication
> > Datastore. The High Replication Datastore is now the default for new App
> > Engine applications, the SLA available under the upcoming new pricing
> model
> > applies exclusively to High Replication Datastore applications, and we
> are
> > testing improved migration tools with early adopters now. (You can sign
> up
> > to be an early adopter at this link: http://goo.gl/3jrXu ) In addition,
> new
> > App Engine features, e.g. Go, Python 2.7 will be available exclusively to
> > High Replication Datastore applications.
> >
> > Timeline (all times US/Pacific)
> >
> > 5:35pm: Google data center loses computing and storage capability, as as
> > result of loss of utility power due to severe thunderstorm in the area.
> > Google data center operations team begins responding to outage, in
> contact
> > with the App Engine team.
> > 6:50pm: Google data center operations reports data center will not return
> to
> > service promptly. App Engine team begins emergency maintenance to switch
> > Master/Slave Datastore applications to backup data center.
> > 7:10pm: appengine-downtime-notify forum is notified:
> >
> https://groups.google.com/forum/#!topic/google-appengine-downtime-notify/_yTJse1eOaI
> > 7:20pm: Master/Slave Datastore applications begin serving in read-only
> mode
> > during the emergency maintenance.
> > 7:50pm: App Engine team completes emergency maintenance, and Master/Slave
> > Datastore applications are serving normally again.
> >
> > -- Ikai Lan, on behalf of the App Engine team
> >
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Google App Engine Downtime Notify" group.
> > To post to this group, send email to
> > [email protected].
> > To unsubscribe from this group, send email to
> > [email protected].
> > For more options, visit this group at
> > http://groups.google.com/group/google-appengine-downtime-notify?hl=en.
> >
>
> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: Postmortem for August 18, 2011 outage

Reply via email to