Postmortem for March 19th, 2010

Nathan Ingersoll Thu, 01 Apr 2010 19:09:52 -0700

Postmortem Summary


This document details the events leading up to and during the Project
Hosting outage on March 19th, 2010. It also includes the causes for
the outage and what we plan to do to prevent this problem in the
future.

On March 19th, 2010, Project Hosting on Google Code experienced
varying degrees of degraded service across all components for
approximately 4 hours beginning at approximately 17:00 GMT.

Approximately 2 hours before the services on Project Hosting were
noticeably impacted, RPC latency accessing the primary datastore began
to increase. Over the following 3 hours, latency steadily increased
and reached a point such that that the connection pools in the HTTP
proxies for the SVN service filled up.  Long RPC deadlines in the SVN
server implementation compounded the latency and led to additional
connection backlog. Due to a configuration error, all Project Hosting
connection pools were considered to be full, and requests to other
Project Hosting services began to fail.

<< Complete timeline provided below. >>


What did we do wrong?


While the team is prepared to handle these types of degraded
conditions, there are some key shortcomings in our infrastructure
design and procedures that exacerbated the problem.

- A majority of the Project Hosting services serve user requests from
a single datacenter at a time. While we have live replication to other
datacenters, there is a manual failover procedure necessary to move
traffic to one of these backups.

- The available backup datacenters were scheduled for maintenance very
soon and we hesitated to move to one of them in the desire to avoid
further read-only time at a later date. The team makes every effort
avoid these read-only windows when possible since they prevent users
from effectively using most features on Project Hosting.

- A configuration mistake in the http proxy configuration led to
additional services besides SVN being impacted. The filling connection
pool to the SVN service was perceived as a connection problem to all
components of the service and the proxies began to reject connections.
These other components had otherwise handled the increased RPC latency
in a reasonable manner.


What are we doing to fix it?


We have been making invasive changes to our infrastructure over the
past year to minimize the impact of these types of problems.
Specifically, rewriting our storage layers to allow serving traffic
from multiple datacenters simultaneously, and to reduce our
sensitivity to fluctuations in the underlying datastore latency.

- Our new SVN server implementation
(http://googlecode.blogspot.com/2010/03/faster-subversion-hosting.html)
is far more efficient and degrades more gracefully in the face of
increased datastore latency. Additionally, it uses Paxos
(http://en.wikipedia.org/wiki/Paxos_algorithm) to guarantee consistent
data across all datacenters. The new implementation was launched prior
to this incident and allowed us to operate far beyond the threshold of
the previous implementation. We have added stricter deadlines to our
RPCs to further reduce the impact of datastore latency.

- We will move our remaining data storage to Paxos based storage
methods, which will give us stronger data consistency across
datacenters, and eliminate the need for read-only time when moving
traffic to different datacenters.

- We will serve user facing traffic from multiple datacenters
simultaneously. This will be possible because of the data consistency
guarantees provided by Paxos, and allows us to constrain the amount of
traffic impacted by an event such as the one on March 19th.

- We have already fixed our http proxy configuration to properly
isolate services.  Future degradation in a single service should not
impact other services.

- Clarifying our oncall documentation to help staff determine the
appropriate time to move user traffic to a backup datacenter in a
timely manner, and provide guidance on escalating investigations into
datastore instability.

We sincerely apologize for any impact this outage may have had on your
development efforts. We will continue to make the improved stability
of the service a top priority for our team, and are dedicated to
providing a reliable, quality development experience to our users.


Timeline (GMT)

15:00 - Internal monitoring detects slowly increasing latency in RPCs
to the datastore. The latency increases are well below paging
thresholds.

17:00 - Internal monitoring begins to detect latency increases in the
Project Hosting service; again the increases are still within paging
thresholds.

18:00 - Latency for the SVN service exceeds paging thresholds and the
Project Hosting team is notified of the problem.

18:15 - Initial investigations reveal that datastore latency has
increased significantly over the last 3 hours. Work begins in
diagnosing the cause of the increased latency.

18:30 - Preparations begin for shifting traffic to a hot spare
datacenter. The team is reluctant to do so with the knowledge that we
will be required to move back to the current datacenter soon because
of scheduled maintenance and incur additional read-only time for end
users.

19:00 - Internal monitoring notifies the team that additional service
components are being impacted by the increased latency. Investigation
reveals that the http proxies are actually the limiting factor for
these components and attempts are made to fix the proxy configuration.

19:15 - Datastore instability investigation is escalated to more
experienced staff and investigation continues.

20:00 - Decision is made to move traffic to one of the hot backup
datacenters. The team begins to place the site in read-only mode to
insure data consistency in backup locations.

20:30 - Stability is returned to the datastore. The failover process
is cancelled and the team begins to rollback changes to return normal
operating conditions.

21:00 - Normal operating conditions are fully restored.

-- 
You received this message because you are subscribed to the Google Groups 
"Project Hosting on Google Code" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-code-hosting?hl=en.

Postmortem for March 19th, 2010

Reply via email to