Thanks for posting this. The transparency helps greatly. I would like to add my vote to Chris's though regarding splitting bigtable operations. It's a mistake.
The enormous appeal of the App Engine today is that you've done an amazing job shielding us from needing to make these sorts of decisions. *THAT* is the hard problem that GAE is addressing. By making distinctions such as this one, you're fundamentally shifting your direction away from what ought to be (and has been?) a key design principle. In short - please don't make us choose. Just make it work. Jan / Cloudbreak On Mar 5, 12:22 pm, App Engine Team <[email protected]> wrote: > Post-Mortem Summary > > This document details the cause and events occurring immediately after > App Engine's outage on February 24th, 2010, as well as the steps we > are taking to mitigate the impact of future outages like this one in > the future. > > On February 24th, 2010, all Googe App Engine applications were in > varying degraded states of operation for a period of two hours and > twenty minutes from 7:48 AM to 10:09 AM PT | 15:48 to 18:09 GMT. The > underlying cause of the outage was a power failure in our primary > datacenter. While the Google App Engine infrastructure is designed to > quickly recover from these sort of failures, this type of rare > problem, combined with internal procedural issues extended the time > required to restore the service. > > <<Link to full timeline here, which is attached below.>> > > What did we do wrong? > > Though the team had planned for this sort of failure, our response had > a few important issues: > > - Although we had procedures ready for this sort of outage, the oncall > staff was unfamiliar with them and had not trained sufficiently with > the specific recovery procedure for this type of failure. > > - Recent work to migrate the datastore for better multihoming changed > and improved the procedure for handling these failures significantly. > However, some documentation detailing the procedure to support the > datastore during failover incorrectly referred to the old > configuration. This led to confusion during the event. > > - The production team had not agreed on a policy that clearly > indicates when, and in what situations, our oncall staff should take > aggressive user-facing actions, such as an unscheduled failover. This > led to a bad call of returning to a partially working datacenter. > > - We failed to plan for the case of a power outage that might affect > some, but not all, of our machines in a datacenter (in this case, > about 25%). In particular, this led to incorrect analysis of the > serving state of the failed datacenter and when it might recover. > > - Though we were able to eventually migrate traffic to the backup > datacenter, a small number of Datastore entity groups, belonging to > approximately 25 applications in total, became stuck in an > inconsistent state as a result of the failover procedure. This > represented considerably less than 0.00002% of data stored in the > Datastore. > > Ultimately, although significant work had been done over the past year > to improve our handling of these types of outages, issues with > procedures reduced their impact. > > What are we doing to fix it? > > As a result, we have instituted the following procedures going > forward: > > - Introduce regular drills by all oncall staff of all of our > production procedures. This will include the rare and complicated > procedures, and all members of the team will be required to complete > the drills before joining the oncall rotation. > > - Implement a regular bi-monthly audit of our operations docs to > ensure that all needed procedures are properly findable, and all out- > of-date docs are properly marked "Deprecated." > > - Establish a clear policy framework to assist oncall staff to quickly > and decisively make decisions about taking intrusive, user-facing > actions during failures. This will allow them to act confidently and > without delay in emergency situations. > > We believe that with these new procedures in place, last week's outage > would have been reduced in impact from about 2 hours of total > unavailability to about 10 to 20 minutes of partial unavailability. > > In response to this outage, we have also decided to make a major > infrastructural change in App Engine. Currently, App Engine provides a > one-size-fits-all Datastore, that provides low write latency combined > with strong consistency, in exchange for lower availability in > situations of unexpected failure in one of our serving datacenters. In > response to this outage, and feedback from our users, we have begun > work on providing two different Datastore configurations: > > - The current option of low-latency, strong consistency, and lower > availability during unexpected failures (like a power outage) > > - A new option for higher availability using synchronous replication > for reads and writes, at the cost of significantly higher latency > > We believe that providing both of these options to you, our users, > will allow you to make your own informed decisions about the tradeoffs > you want to make in running your applications. > > We sincerely apologize for the impact of Feb 24th's service disruption > on your applications. We take great pride in the reliability that App > Engine offers, but we also recognize that we can do more to improve > it. You can be confident that we will continue to work diligently to > improve the service and ensure the impact of low level outages like > this have the least possible affect on our customers. > > Timeline > ----------- > > 7:48 AM - Internal monitoring graphs first begin to show that traffic > has problems in our primary datacenter and is returning an elevated > number of errors. Around the same time, posts begin to show up in the > google-appengine discussion group from users who are having trouble > accessing App Engine. > > 7:53 AM - Google Site Reliabilty Engineers send an email to a broad > audience notifying oncall staff that there has been a power outage in > our primary datacenter. Google's datacenters have backup power > generators for these situations. But, in this case, around 25% of > machines in the datacenter did not receive backup power in time and > crashed. At this time, our oncall staff was paged. > > 8:01 AM - By this time, our primary oncall engineer has determined the > extent and the impact of the page, and has determined that App Engine > is down. The oncall engineer, according to procedure, pages our > product managers and engineering leads to handle communicating about > the outage to out users. A few minutes later, the first post from the > App Engine team about this outage is made on the external group ("We > are investigating this issue."). > > 8:22 AM - After further analysis, we determine that although power has > returned to the datacenter, many machines in the datacenter are > missing due to the power outage, and are not able to serve traffic. > Particularly, it is determined that the GFS and Bigtable clusters are > not in a functioning state due to having lost too many machines, and > that thus the Datastore is not usable in the primary datacenter at > that time. The oncall engineer discusses performing a failover to our > alternate datacenter with the rest of the oncall team. Agreement is > reached to pursue our unexpected failover procedure for an unplanned > datacenter outages. > > 8:36 AM - Following up on the post on the discussion group outage > thread, the App Engine team makes a post about the outage to our > appengine-downtime-notify group and to the App Engine Status site. > > 8:40 AM - The primary oncall engineer discovers two conflicting sets > of procedures. This was a result of the operations process changing > after our recent migration of the Datastore. After discussion with > other oncall engineers, consensus is not reached, and members of the > engineering team attempt to contact the specific engineers responsible > for procedure change to resolve the situation. > > 8:44 AM - While others attempt to determine which is the correct > unexpected failover procedure, the oncall engineer attempts to move > all traffic into a read-only state in our alternate datacenter. > Traffic is moved, but an unexpected configuration problem from this > procedure prevents the read-only traffic from working properly. > > 9:08 AM - Various engineers are diagnosing the problem with read-only > traffic in our alternate datacenter. In the meantime, however, the > primary oncall engineer sees data that leads them to believe that our > primary datacenter has recovered and may be able to serve. Without a > clear rubric with which to make this decision, however, the engineer > was not aware that based on historical data the primary datacenter is > unlikely to have recovered to a usable state by this point of time. > Traffic is moved back to the original primary datacenter as an attempt > to resume serving, while others debug the read-only issue in the > alternate datacenter. > > 9:18 AM - The primary oncall engineer determines that the primary > datacenter has not recovered, and cannot serve traffic. It is now > clear to oncall staff that the call was wrong, the primary will not > recover, and we must focus on the alternate datacenter. Traffic is > failed back over to the alternate datacenter, and the oncall makes the > decision to follow the unplanned failover procedure and begins the > process. > > 9:35 AM - An engineer with familiarity with the unplanned failover > procedure is reached, and begins providing guidance about the failover > procedure. Traffic is moved to our alternate datacenter, initially in > read-only mode. > > 9:48 AM - Serving for App Engine begins externally in read-only mode, > from our alternate datacenter. At this point, apps that properly > handle read-only periods should be serving correctly, though in a > reduced operational state. > > 9:53 AM - After engineering team consultation with the relevant > engineers, now online, the correct unplanned failover procedure > operations document is confirmed, and is ready to be used by the > oncall engineer. The actual unplanned failover procedure for reads and > writes begins. > > 10:09 AM - The unplanned failover procedure completes, without any > problems. Traffic resumes serving normally, read and write. App Engine > is considered up at this time. > > 10:19 AM - A follow-up post is made to the appengine-downtime-notify > group, letting people know that App Engine is now serving normally. -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
