On Wed, Sep 12, 2012 at 3:12 PM, Kaan Soral <[email protected]> wrote:
> This is why I love App Engine, when a problem occurs instead of having a
> heart attack or committing suicide, you can just wait for it to be resolved.

Hmmm.  This really unfortunately timed incident may have cost us an
important client, so I'm not feeling the love.

I have quite a lot of experience building and running large online
systems prior to embracing GAE and my products have never had as much
downtime as I've had over the last year.  It hasn't always been
Google's fault (the entire .st registry going down for 8+ hours really
sucked[1]) but it usually has been.  See:

 * Instance startup time ballooning by 3X and hitting deadlines
(multiple occasions)
 * GAE blocking CloudFlare with an undocumented security system
 * This incident, where Java instances started mysteriously failing

Would waiting have fixed these issues?  I'm not convinced.  Google may
have smart people running GAE but they aren't watching _my_ app,
they're just watching for an uptick in the number of complaints.  If
you're doing something slightly unusual (say, running a CF reverse
proxy), you might be statistical noise.  Apparently this Java problem
_was_ widespread, but I had no way of knowing that.

GAE's value proposition is that it's better to have Google's smart
engineers building and maintaining your infrastructure.  But my site
would be more reliable if I had one dumb person (possibly me) who
cares specifically about _my_ infrastructure.  I've screwed up
deployments and upgrades in production before, but at least I'm aware
when changes happen, get immediate feedback, and can fix the problem
right then and there.

With GAE, the only thing I can do when my alarms go off is to whine as
loudly as possible.  But there is no feedback!  I have no way of
knowing if Google is working on the problem or if they're still
waiting for more complaints that will never materialize.  Will I be
down for 15 minutes, 1 hour, 2 hours, 8 hours, forever?  How long do
you want to wait?

This feels like a fundamental flaw in the PaaS concept, destined to
produce multiple-hour downtimes at irregular intervals.  The feedback
loop is too slow (and lossy if the problem is not widespread).
There's no amount of QA or testing that will prevent failures in a
system as big as complicated as GAE.  So the only reasonable option is
to get that feedback loop shorter.  How can that happen?  Some ideas:

 * Google could announce when they are rolling out changes.  I don't
need release notes (although it would be nice to know what to watch
for) but I'd like to know when I should pay extra attention.  Or not
schedule client demos.  Facebook does something like this, rolling out
platform changes on specific days of the week (which I long ago
stopped caring about).

 * Google could make extra support channels available during this
time.  Hell, use twitter.  Think of us as your QA staff - if we see
something amiss, we'd like to let you know.

 * Google could be more transparent about problems as they happen.
When you know there is an issue, let us know.  Since I must assume
that any problem which Google hasn't acknowledged is a problem Google
doesn't know about, I can stop spamming @google.com addresses.

 * Google could monitor our apps, and compare error rates before
rollout to error rates after rollout.  Ideally you'd break this down
by component; figure out which apps use the search api, so when you
roll out changes to the search system, you're specifically watching
for an uptick in 500 errors from those apps.  Something like that.

Any other ideas?  I really like GAE and I really like the PaaS
concept.  But reliability is really a problem.  It's probably going to
be an even bigger problem going on into the future as GAE (hopefully)
adds new features and gets a bigger footprint.  More moving parts
means more failures.

Jeff

P.S. Paying $6k/yr for Premier Support is not the answer.  Whether or
not that would solve my problem, that doesn't solve GAE's problem.

   [1]: 
http://blorn.com/post/29851770158/beware-cutesy-two-letter-tlds-for-your-domain-name

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to