Postmortem Summary
This document details the events leading up to and during the Project Hosting outage on March 19th, 2010. It also includes the causes for the outage and what we plan to do to prevent this problem in the future. On March 19th, 2010, Project Hosting on Google Code experienced varying degrees of degraded service across all components for approximately 4 hours beginning at approximately 17:00 GMT. Approximately 2 hours before the services on Project Hosting were noticeably impacted, RPC latency accessing the primary datastore began to increase. Over the following 3 hours, latency steadily increased and reached a point such that that the connection pools in the HTTP proxies for the SVN service filled up. Long RPC deadlines in the SVN server implementation compounded the latency and led to additional connection backlog. Due to a configuration error, all Project Hosting connection pools were considered to be full, and requests to other Project Hosting services began to fail. << Complete timeline provided below. >> What did we do wrong? While the team is prepared to handle these types of degraded conditions, there are some key shortcomings in our infrastructure design and procedures that exacerbated the problem. - A majority of the Project Hosting services serve user requests from a single datacenter at a time. While we have live replication to other datacenters, there is a manual failover procedure necessary to move traffic to one of these backups. - The available backup datacenters were scheduled for maintenance very soon and we hesitated to move to one of them in the desire to avoid further read-only time at a later date. The team makes every effort avoid these read-only windows when possible since they prevent users from effectively using most features on Project Hosting. - A configuration mistake in the http proxy configuration led to additional services besides SVN being impacted. The filling connection pool to the SVN service was perceived as a connection problem to all components of the service and the proxies began to reject connections. These other components had otherwise handled the increased RPC latency in a reasonable manner. What are we doing to fix it? We have been making invasive changes to our infrastructure over the past year to minimize the impact of these types of problems. Specifically, rewriting our storage layers to allow serving traffic from multiple datacenters simultaneously, and to reduce our sensitivity to fluctuations in the underlying datastore latency. - Our new SVN server implementation (http://googlecode.blogspot.com/2010/03/faster-subversion-hosting.html) is far more efficient and degrades more gracefully in the face of increased datastore latency. Additionally, it uses Paxos (http://en.wikipedia.org/wiki/Paxos_algorithm) to guarantee consistent data across all datacenters. The new implementation was launched prior to this incident and allowed us to operate far beyond the threshold of the previous implementation. We have added stricter deadlines to our RPCs to further reduce the impact of datastore latency. - We will move our remaining data storage to Paxos based storage methods, which will give us stronger data consistency across datacenters, and eliminate the need for read-only time when moving traffic to different datacenters. - We will serve user facing traffic from multiple datacenters simultaneously. This will be possible because of the data consistency guarantees provided by Paxos, and allows us to constrain the amount of traffic impacted by an event such as the one on March 19th. - We have already fixed our http proxy configuration to properly isolate services. Future degradation in a single service should not impact other services. - Clarifying our oncall documentation to help staff determine the appropriate time to move user traffic to a backup datacenter in a timely manner, and provide guidance on escalating investigations into datastore instability. We sincerely apologize for any impact this outage may have had on your development efforts. We will continue to make the improved stability of the service a top priority for our team, and are dedicated to providing a reliable, quality development experience to our users. Timeline (GMT) 15:00 - Internal monitoring detects slowly increasing latency in RPCs to the datastore. The latency increases are well below paging thresholds. 17:00 - Internal monitoring begins to detect latency increases in the Project Hosting service; again the increases are still within paging thresholds. 18:00 - Latency for the SVN service exceeds paging thresholds and the Project Hosting team is notified of the problem. 18:15 - Initial investigations reveal that datastore latency has increased significantly over the last 3 hours. Work begins in diagnosing the cause of the increased latency. 18:30 - Preparations begin for shifting traffic to a hot spare datacenter. The team is reluctant to do so with the knowledge that we will be required to move back to the current datacenter soon because of scheduled maintenance and incur additional read-only time for end users. 19:00 - Internal monitoring notifies the team that additional service components are being impacted by the increased latency. Investigation reveals that the http proxies are actually the limiting factor for these components and attempts are made to fix the proxy configuration. 19:15 - Datastore instability investigation is escalated to more experienced staff and investigation continues. 20:00 - Decision is made to move traffic to one of the hot backup datacenters. The team begins to place the site in read-only mode to insure data consistency in backup locations. 20:30 - Stability is returned to the datastore. The failover process is cancelled and the team begins to rollback changes to return normal operating conditions. 21:00 - Normal operating conditions are fully restored. -- You received this message because you are subscribed to the Google Groups "Project Hosting on Google Code" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-code-hosting?hl=en.

