[google-appengine] Post-mortem for February 24th, 2010 outage

App Engine Team Thu, 04 Mar 2010 15:22:35 -0800

Post-Mortem Summary

This document details the cause and events occurring immediately after
App Engine's outage on February 24th, 2010, as well as the steps we
are taking to mitigate the impact of future outages like this one in
the future.


On February 24th, 2010, all Googe App Engine applications were in
varying degraded states of operation for a period of two hours and
twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT.  The
underlying cause of the outage was a power failure in our primary
datacenter. While the Google App Engine infrastructure is designed to
quickly recover from these sort of failures, this type of rare
problem, combined with internal procedural issues  extended the time
required to restore the service.

<<Link to full timeline here, which is attached below.>>

What did we do wrong?

Though the team had planned for this sort of failure, our response had
a few important issues:

- Although we had procedures ready for this sort of outage, the oncall
staff was unfamiliar with them and had not trained sufficiently with
the specific recovery procedure for this type of failure.

- Recent work to migrate the datastore for better multihoming changed
and improved the procedure for handling these failures significantly.
However, some documentation detailing the procedure to support the
datastore during failover incorrectly referred to the old
configuration. This led to confusion during the event.

- The production team had not agreed on a policy that clearly
indicates when, and in what situations, our oncall staff should take
aggressive user-facing actions, such as an unscheduled failover.  This
led to a bad call of returning to a partially working datacenter.

- We failed to plan for the case of a power outage that might affect
some, but not all, of our machines in a datacenter (in this case,
about 25%). In particular, this led to incorrect analysis of the
serving state of the failed datacenter and when it might recover.

- Though we were able to eventually migrate traffic to the backup
datacenter, a small number of Datastore entity groups, belonging to
approximately 25 applications in total,  became stuck in an
inconsistent state as a result of the failover procedure. This
represented considerably less than 0.00002% of data stored in the
Datastore.

Ultimately, although significant work had been done over the past year
to improve our handling of these types of outages, issues with
procedures reduced their impact.

What are we doing to fix it?

As a result, we have instituted the following procedures going
forward:

- Introduce regular drills by all oncall staff of all of our
production procedures. This will include the rare and complicated
procedures, and all members of the team will be required to complete
the drills before joining the oncall rotation.

- Implement a regular bi-monthly audit of our operations docs to
ensure that all needed procedures are properly findable, and all out-
of-date docs are properly marked "Deprecated."

- Establish a clear policy framework to assist oncall staff to quickly
and decisively make decisions about taking intrusive, user-facing
actions during failures. This will allow them to act confidently and
without delay in emergency situations.

We believe that with these new procedures in place, last week's outage
would have been reduced in impact from about 2 hours of total
unavailability to about 10 to 20 minutes of partial unavailability.

In response to this outage, we have also decided to make a major
infrastructural change in App Engine. Currently, App Engine provides a
one-size-fits-all Datastore, that provides low write latency combined
with strong consistency, in exchange for lower availability in
situations of unexpected failure in one of our serving datacenters. In
response to this outage, and feedback from our users, we have begun
work on providing two different Datastore configurations:

- The current option of low-latency, strong consistency, and lower
availability during unexpected failures (like a power outage)

- A new option for higher availability using synchronous replication
for reads and writes, at the cost of significantly higher latency

We believe that providing both of these options to you, our users,
will allow you to make your own informed decisions about the tradeoffs
you want to make in running your applications.

We sincerely apologize for the impact of Feb 24th's service disruption
on your applications. We take great pride in the reliability that App
Engine offers, but we also recognize that we can do more to improve
it. You can be confident that we will continue to work diligently to
improve the service and ensure the impact of low level outages like
this have the least possible affect on our customers.


Timeline
-----------

7:48 AM - Internal monitoring graphs first begin to show that traffic
has problems in our primary datacenter and is returning an elevated
number of errors. Around the same time, posts begin to show up in the
google-appengine discussion group from users who are having trouble
accessing App Engine.

7:53 AM - Google Site Reliabilty Engineers send an email to a broad
audience notifying oncall staff that there has been a power outage in
our primary datacenter. Google's datacenters have backup power
generators for these situations. But, in this case, around 25% of
machines in the datacenter did not receive backup power in time and
crashed. At this time, our oncall staff was paged.

8:01 AM - By this time, our primary oncall engineer has determined the
extent and the impact of the page, and has determined that App Engine
is down. The oncall engineer, according to procedure, pages our
product managers and engineering leads to handle communicating about
the outage to out users. A few minutes later, the first post from the
App Engine team about this outage is made on the external group ("We
are investigating this issue.").

8:22 AM - After further analysis, we determine that although power has
returned to the datacenter, many machines in the datacenter are
missing due to the power outage, and are not able to serve traffic.
Particularly, it is determined that the GFS and Bigtable clusters are
not in a functioning state due to having lost too many machines, and
that thus the Datastore is not usable in the primary datacenter at
that time. The oncall engineer discusses performing a failover to our
alternate datacenter with the rest of the oncall team. Agreement is
reached to pursue our unexpected failover procedure for an unplanned
datacenter outages.

8:36 AM - Following up on the post on the discussion group outage
thread, the App Engine team makes a post about the outage to our
appengine-downtime-notify group and to the App Engine Status site.

8:40 AM - The primary oncall engineer discovers two conflicting sets
of procedures. This was a result of the operations process changing
after our recent migration of the Datastore. After discussion with
other oncall engineers, consensus is not reached, and members of the
engineering team attempt to contact the specific engineers responsible
for procedure change to resolve the situation.

8:44 AM - While others attempt to determine which is the correct
unexpected failover procedure, the oncall engineer attempts to move
all traffic into a read-only state in our alternate datacenter.
Traffic is moved, but an unexpected configuration problem from this
procedure prevents the read-only traffic from working properly.

9:08 AM - Various engineers are diagnosing the problem with read-only
traffic in our alternate datacenter. In the meantime, however, the
primary oncall engineer sees data that leads them to believe that our
primary datacenter has recovered and may be able to serve. Without a
clear rubric with which to make this decision, however, the engineer
was not aware that based on historical data the primary datacenter is
unlikely to have recovered to a usable state by this point of time.
Traffic is moved back to the original primary datacenter as an attempt
to resume serving, while others debug the read-only issue in the
alternate datacenter.

9:18 AM - The primary oncall engineer determines that the primary
datacenter has not recovered, and cannot serve traffic. It is now
clear to oncall staff that the call was wrong, the primary will not
recover, and we must focus on the alternate datacenter. Traffic is
failed back over to the alternate datacenter, and the oncall makes the
decision to follow the unplanned failover procedure and begins the
process.

9:35 AM - An engineer with familiarity with the unplanned failover
procedure is reached, and begins providing guidance about the failover
procedure. Traffic is moved to our alternate datacenter, initially in
read-only mode.

9:48 AM - Serving for App Engine begins externally in read-only mode,
from our alternate datacenter. At this point, apps that properly
handle read-only periods should be serving correctly, though in a
reduced operational state.

9:53 AM - After engineering team consultation with the relevant
engineers, now online, the correct unplanned failover procedure
operations document is confirmed, and is ready to be used by the
oncall engineer. The actual unplanned failover procedure for reads and
writes begins.

10:09 AM - The unplanned failover procedure completes, without any
problems. Traffic resumes serving normally, read and write. App Engine
is considered up at this time.

10:19 AM - A follow-up post is made to the appengine-downtime-notify
group, letting people know that App Engine is now serving normally.


-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Post-mortem for February 24th, 2010 outage

Reply via email to