[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Jan Z Thu, 18 Mar 2010 02:33:34 -0700

Thanks for posting this.  The transparency helps greatly.

I would like to add my vote to Chris's though regarding splitting
bigtable operations.
It's a mistake.


The enormous appeal of the App Engine today is that you've done an
amazing job shielding us from needing to make these sorts of
decisions.

*THAT* is the hard problem that GAE is addressing.

By making distinctions such as this one, you're fundamentally shifting
your direction away from what ought to be (and has been?) a key design
principle.

In short - please don't make us choose.  Just make it work.

Jan / Cloudbreak

On Mar 5, 12:22 pm, App Engine Team <[email protected]>
wrote:
> Post-Mortem Summary
>
> This document details the cause and events occurring immediately after
> App Engine's outage on February 24th, 2010, as well as the steps we
> are taking to mitigate the impact of future outages like this one in
> the future.
>
> On February 24th, 2010, all Googe App Engine applications were in
> varying degraded states of operation for a period of two hours and
> twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT.  The
> underlying cause of the outage was a power failure in our primary
> datacenter. While the Google App Engine infrastructure is designed to
> quickly recover from these sort of failures, this type of rare
> problem, combined with internal procedural issues  extended the time
> required to restore the service.
>
> <<Link to full timeline here, which is attached below.>>
>
> What did we do wrong?
>
> Though the team had planned for this sort of failure, our response had
> a few important issues:
>
> - Although we had procedures ready for this sort of outage, the oncall
> staff was unfamiliar with them and had not trained sufficiently with
> the specific recovery procedure for this type of failure.
>
> - Recent work to migrate the datastore for better multihoming changed
> and improved the procedure for handling these failures significantly.
> However, some documentation detailing the procedure to support the
> datastore during failover incorrectly referred to the old
> configuration. This led to confusion during the event.
>
> - The production team had not agreed on a policy that clearly
> indicates when, and in what situations, our oncall staff should take
> aggressive user-facing actions, such as an unscheduled failover.  This
> led to a bad call of returning to a partially working datacenter.
>
> - We failed to plan for the case of a power outage that might affect
> some, but not all, of our machines in a datacenter (in this case,
> about 25%). In particular, this led to incorrect analysis of the
> serving state of the failed datacenter and when it might recover.
>
> - Though we were able to eventually migrate traffic to the backup
> datacenter, a small number of Datastore entity groups, belonging to
> approximately 25 applications in total,  became stuck in an
> inconsistent state as a result of the failover procedure. This
> represented considerably less than 0.00002% of data stored in the
> Datastore.
>
> Ultimately, although significant work had been done over the past year
> to improve our handling of these types of outages, issues with
> procedures reduced their impact.
>
> What are we doing to fix it?
>
> As a result, we have instituted the following procedures going
> forward:
>
> - Introduce regular drills by all oncall staff of all of our
> production procedures. This will include the rare and complicated
> procedures, and all members of the team will be required to complete
> the drills before joining the oncall rotation.
>
> - Implement a regular bi-monthly audit of our operations docs to
> ensure that all needed procedures are properly findable, and all out-
> of-date docs are properly marked "Deprecated."
>
> - Establish a clear policy framework to assist oncall staff to quickly
> and decisively make decisions about taking intrusive, user-facing
> actions during failures. This will allow them to act confidently and
> without delay in emergency situations.
>
> We believe that with these new procedures in place, last week's outage
> would have been reduced in impact from about 2 hours of total
> unavailability to about 10 to 20 minutes of partial unavailability.
>
> In response to this outage, we have also decided to make a major
> infrastructural change in App Engine. Currently, App Engine provides a
> one-size-fits-all Datastore, that provides low write latency combined
> with strong consistency, in exchange for lower availability in
> situations of unexpected failure in one of our serving datacenters. In
> response to this outage, and feedback from our users, we have begun
> work on providing two different Datastore configurations:
>
> - The current option of low-latency, strong consistency, and lower
> availability during unexpected failures (like a power outage)
>
> - A new option for higher availability using synchronous replication
> for reads and writes, at the cost of significantly higher latency
>
> We believe that providing both of these options to you, our users,
> will allow you to make your own informed decisions about the tradeoffs
> you want to make in running your applications.
>
> We sincerely apologize for the impact of Feb 24th's service disruption
> on your applications. We take great pride in the reliability that App
> Engine offers, but we also recognize that we can do more to improve
> it. You can be confident that we will continue to work diligently to
> improve the service and ensure the impact of low level outages like
> this have the least possible affect on our customers.
>
> Timeline
> -----------
>
> 7:48 AM - Internal monitoring graphs first begin to show that traffic
> has problems in our primary datacenter and is returning an elevated
> number of errors. Around the same time, posts begin to show up in the
> google-appengine discussion group from users who are having trouble
> accessing App Engine.
>
> 7:53 AM - Google Site Reliabilty Engineers send an email to a broad
> audience notifying oncall staff that there has been a power outage in
> our primary datacenter. Google's datacenters have backup power
> generators for these situations. But, in this case, around 25% of
> machines in the datacenter did not receive backup power in time and
> crashed. At this time, our oncall staff was paged.
>
> 8:01 AM - By this time, our primary oncall engineer has determined the
> extent and the impact of the page, and has determined that App Engine
> is down. The oncall engineer, according to procedure, pages our
> product managers and engineering leads to handle communicating about
> the outage to out users. A few minutes later, the first post from the
> App Engine team about this outage is made on the external group ("We
> are investigating this issue.").
>
> 8:22 AM - After further analysis, we determine that although power has
> returned to the datacenter, many machines in the datacenter are
> missing due to the power outage, and are not able to serve traffic.
> Particularly, it is determined that the GFS and Bigtable clusters are
> not in a functioning state due to having lost too many machines, and
> that thus the Datastore is not usable in the primary datacenter at
> that time. The oncall engineer discusses performing a failover to our
> alternate datacenter with the rest of the oncall team. Agreement is
> reached to pursue our unexpected failover procedure for an unplanned
> datacenter outages.
>
> 8:36 AM - Following up on the post on the discussion group outage
> thread, the App Engine team makes a post about the outage to our
> appengine-downtime-notify group and to the App Engine Status site.
>
> 8:40 AM - The primary oncall engineer discovers two conflicting sets
> of procedures. This was a result of the operations process changing
> after our recent migration of the Datastore. After discussion with
> other oncall engineers, consensus is not reached, and members of the
> engineering team attempt to contact the specific engineers responsible
> for procedure change to resolve the situation.
>
> 8:44 AM - While others attempt to determine which is the correct
> unexpected failover procedure, the oncall engineer attempts to move
> all traffic into a read-only state in our alternate datacenter.
> Traffic is moved, but an unexpected configuration problem from this
> procedure prevents the read-only traffic from working properly.
>
> 9:08 AM - Various engineers are diagnosing the problem with read-only
> traffic in our alternate datacenter. In the meantime, however, the
> primary oncall engineer sees data that leads them to believe that our
> primary datacenter has recovered and may be able to serve. Without a
> clear rubric with which to make this decision, however, the engineer
> was not aware that based on historical data the primary datacenter is
> unlikely to have recovered to a usable state by this point of time.
> Traffic is moved back to the original primary datacenter as an attempt
> to resume serving, while others debug the read-only issue in the
> alternate datacenter.
>
> 9:18 AM - The primary oncall engineer determines that the primary
> datacenter has not recovered, and cannot serve traffic. It is now
> clear to oncall staff that the call was wrong, the primary will not
> recover, and we must focus on the alternate datacenter. Traffic is
> failed back over to the alternate datacenter, and the oncall makes the
> decision to follow the unplanned failover procedure and begins the
> process.
>
> 9:35 AM - An engineer with familiarity with the unplanned failover
> procedure is reached, and begins providing guidance about the failover
> procedure. Traffic is moved to our alternate datacenter, initially in
> read-only mode.
>
> 9:48 AM - Serving for App Engine begins externally in read-only mode,
> from our alternate datacenter. At this point, apps that properly
> handle read-only periods should be serving correctly, though in a
> reduced operational state.
>
> 9:53 AM - After engineering team consultation with the relevant
> engineers, now online, the correct unplanned failover procedure
> operations document is confirmed, and is ready to be used by the
> oncall engineer. The actual unplanned failover procedure for reads and
> writes begins.
>
> 10:09 AM - The unplanned failover procedure completes, without any
> problems. Traffic resumes serving normally, read and write. App Engine
> is considered up at this time.
>
> 10:19 AM - A follow-up post is made to the appengine-downtime-notify
> group, letting people know that App Engine is now serving normally.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

[google-appengine] Re: Post-mortem for February 24th, 2010 outage

Reply via email to