article on the big ec2 outage

Jim Gilliam Mon, 11 Jan 2010 17:56:23 -0800

after reading this, i'm amazed heroku got back up with no data loss in only
an hour!

http://searchcloudcomputing.techtarget.com/news/article/0,289142,sid201_gci1378426,00.html

Heroku learns the hard way from Amazon EC2 outageBy Carl Brooks, Technology
Writer
08 Jan 2010 | SearchCloudComputing.com
Ruby on Rails Platform as a Service startup Heroku started off the new year
with a nasty surprise. Without warning on January 2, all of the specialized,
high-capacity Amazon EC2 instances that run its popular application and
development service disappeared in the blink of an eye. Twenty-two virtual
machines, approximately $20,000 per month in hosting fees for high-memory
m2.2xlarge instances, suddenly vanished, leaving Heroku's estimated 44,000
running applications in the lurch.

Amazon blamed a routing device in its Virginia data center, and the service
was back up in an hour. But Oren Teich, Heroku's product developer, said
this is one of the many important lessons new ventures and businesses need
to learn before they decide to work entirely in the cloud. Traditional
contingency planning doesn't go far enough, he said: expect the unexpected.

"[Normally] you need to assume that anything can fail -- where we didn't go
far enough was to assume that *everything*can fail," he said.

Teich said that while Heroku had designed for redundant servers and failover
capacity, this was a novel kind of blackout for a hosting provider. A server
failing was normal, he said, but it was unheard of for a whole class of
resources to suddenly vanish.

Heroku had recently moved its front-end servers onto the high memory
m2.2xlarge instances, and some of those instances were already running "core
back-end stuff."

Teich also said that all of Heroku's m2.2xlarge instances were running in a
single availability zone, which was a mistake. He stressed that Heroku had
failover built in already -- if 21 instances had failed instead of 22, or if
it had spread instances across several zones, "we wouldn't be talking [about
the outage]," he said.

Nevertheless, on Friday, January 2, every m2.2xlarge instance in that
availability zone suddenly vanished, despite all other types of EC2
instances running as normal. That's unheard of in traditional hosting. It
would be like every server with a given amount of RAM suddenly shutting
down, regardless of operating system, age, brand, hardware or location in
the data center, with no effect on its neighbors.

"For us, there's the stuff you plan for and then there's the stuff you don't
even know about," Teich said.

An event like this was an "unknown unknown" that nobody planned for because
nobody imagined it. He chalked it up to the learning process and pointed out
that everybody in Amazon Web Services was flying by the seat of the pants at
least part of the time.

"It's not like there's 'best practices' for cloud computing yet," he said.

*EC2 expert understands cloud issues*
"I sympathize with Oren!" said EC2 expert and consultant Shlomo
Swidler<http://orchestratus.com/>.
"It's not easy to imagine completely new ways for things to fail, especially
things as complex as [EC2 and AWS]."

Swidler said more unique problems were bound to occur, but unanticipated
hiccups would shrink over time as users pooled their experiences -- typical
for technologies with an enthusiastic community. Until then, however, and
especially in high availability systems like Heroku's PaaS service, the
risks of a new frontier remained.

"We'll all learn to consider those new failure modes in our designs. Until
then, the early adopters should be aware that they're accepting a certain
risk," Swidler said.

On the other hand, Amazon appears to have learned from past missteps. Teich
said he couldn't fault the support given by AWS, a different story than
others have told in the past. Even though the fault lay with Amazon's
operation, Heroku was contacted by AWS staff, who arranged for engineers
from both organizations to work on piecing together the incident and
preventative measures.

"They actually reached out to us," Teich said.

Amazon has been taken to task by other users in the past over the company's
lack of communication and transparency toward operational problems. Teich,
however, said he was more than happy with the process, which may mark the
turning of a new leaf for AWS in terms of user relations.

Teich added that despite incidents like these, which show that Amazon has
its share of quirks, the ease and flexibility of the service more than make
up for it.

A 15-person start up like Heroku could never support its thousands of users
for a measly few million in venture capital with traditional hosting, and
along with the cost benefits of using AWS, Heroku gets to blaze a trail for
next-generation platforms by discovering problems traditional hosting
doesn't have.

"The big lesson is that no matter how smart you are, it'll happen to you,"
he said.

*Carl Brooks is the Technology Writer at SearchCloudComputing.com. Contact
him at [email protected].*

--
You received this message because you are subscribed to the Google Groups "Heroku" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to [email protected].
For more options, visit this group at http://groups.google.com/group/heroku?hl=en.

article on the big ec2 outage

Reply via email to