Jeff,

these are good ideas and suggestions.  we are working on a number of 
different strategies to ameliorate these issues.  some of the items you are 
suggesting are already in progress, and others besides.  and i agree that 
this is a general philosophical challenge with PaaS.  on GAE we now 
regularly serve several hundreds of thousands of applications, so it is 
indeed a challenge to handle the "long tail" problem.  we are aware of 
this, and you should expect us to be rolling out a number of things to 
address it.  in fact, we expect to make our experience of running this 
large workload over a long period of time into an advantage with GAE. 

Peter S Magnusson
(GAE Eng Dir)


On Wednesday, September 12, 2012 5:35:39 PM UTC-7, Jeff Schnitzer wrote:
>
> On Wed, Sep 12, 2012 at 3:12 PM, Kaan Soral <[email protected]<javascript:>> 
> wrote: 
> > This is why I love App Engine, when a problem occurs instead of having a 
> > heart attack or committing suicide, you can just wait for it to be 
> resolved. 
>
> Hmmm.  This really unfortunately timed incident may have cost us an 
> important client, so I'm not feeling the love. 
>
> I have quite a lot of experience building and running large online 
> systems prior to embracing GAE and my products have never had as much 
> downtime as I've had over the last year.  It hasn't always been 
> Google's fault (the entire .st registry going down for 8+ hours really 
> sucked[1]) but it usually has been.  See: 
>
>  * Instance startup time ballooning by 3X and hitting deadlines 
> (multiple occasions) 
>  * GAE blocking CloudFlare with an undocumented security system 
>  * This incident, where Java instances started mysteriously failing 
>
> Would waiting have fixed these issues?  I'm not convinced.  Google may 
> have smart people running GAE but they aren't watching _my_ app, 
> they're just watching for an uptick in the number of complaints.  If 
> you're doing something slightly unusual (say, running a CF reverse 
> proxy), you might be statistical noise.  Apparently this Java problem 
> _was_ widespread, but I had no way of knowing that. 
>
> GAE's value proposition is that it's better to have Google's smart 
> engineers building and maintaining your infrastructure.  But my site 
> would be more reliable if I had one dumb person (possibly me) who 
> cares specifically about _my_ infrastructure.  I've screwed up 
> deployments and upgrades in production before, but at least I'm aware 
> when changes happen, get immediate feedback, and can fix the problem 
> right then and there. 
>
> With GAE, the only thing I can do when my alarms go off is to whine as 
> loudly as possible.  But there is no feedback!  I have no way of 
> knowing if Google is working on the problem or if they're still 
> waiting for more complaints that will never materialize.  Will I be 
> down for 15 minutes, 1 hour, 2 hours, 8 hours, forever?  How long do 
> you want to wait? 
>
> This feels like a fundamental flaw in the PaaS concept, destined to 
> produce multiple-hour downtimes at irregular intervals.  The feedback 
> loop is too slow (and lossy if the problem is not widespread). 
> There's no amount of QA or testing that will prevent failures in a 
> system as big as complicated as GAE.  So the only reasonable option is 
> to get that feedback loop shorter.  How can that happen?  Some ideas: 
>
>  * Google could announce when they are rolling out changes.  I don't 
> need release notes (although it would be nice to know what to watch 
> for) but I'd like to know when I should pay extra attention.  Or not 
> schedule client demos.  Facebook does something like this, rolling out 
> platform changes on specific days of the week (which I long ago 
> stopped caring about). 
>
>  * Google could make extra support channels available during this 
> time.  Hell, use twitter.  Think of us as your QA staff - if we see 
> something amiss, we'd like to let you know. 
>
>  * Google could be more transparent about problems as they happen. 
> When you know there is an issue, let us know.  Since I must assume 
> that any problem which Google hasn't acknowledged is a problem Google 
> doesn't know about, I can stop spamming @google.com addresses. 
>
>  * Google could monitor our apps, and compare error rates before 
> rollout to error rates after rollout.  Ideally you'd break this down 
> by component; figure out which apps use the search api, so when you 
> roll out changes to the search system, you're specifically watching 
> for an uptick in 500 errors from those apps.  Something like that. 
>
> Any other ideas?  I really like GAE and I really like the PaaS 
> concept.  But reliability is really a problem.  It's probably going to 
> be an even bigger problem going on into the future as GAE (hopefully) 
> adds new features and gets a bigger footprint.  More moving parts 
> means more failures. 
>
> Jeff 
>
> P.S. Paying $6k/yr for Premier Support is not the answer.  Whether or 
> not that would solve my problem, that doesn't solve GAE's problem. 
>
>    [1]: 
> http://blorn.com/post/29851770158/beware-cutesy-two-letter-tlds-for-your-domain-name
>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine/-/J1qn8o1RjtwJ.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Reply via email to