Re: [google-appengine] Re: APP DOWN: App suddenly no longer starts, no code changes

Peter Magnusson Thu, 13 Sep 2012 20:20:12 -0700

point taken.  i agree.


On Thu, Sep 13, 2012 at 12:11 PM, Per <[email protected]> wrote:

> Hi Peter,
>
> I'm sure it's extremely(!) hard to host this many applications reliably
> and at a sane cost. I won't try to make any suggestions about that.  But
> what concerns me is the support situation on App Engine. It shouldn't be
> too hard to monitor a low-bandwidth forum like this, and to provide
> somewhat timely feedback when downtime is being reported by some of the
> most senior users, during a time when you're actually rolling out changes.
>
> I can understand that you don't want to turn this into an official support
> forum for all the dumb question we see here, and I can understand that you
> want to encourage us to purchase premium accounts. But I feel you're
> hurting your own business by not responding swiftly to posts like these.
> We've been through similar issues with our application, and the support
> situation is really the one thing that stops me from recommending App
> Engine wholeheartedly. I bet I'm not the only one. :) When discussing your
> strategies, it would be great if you could consider "improved
> responsiveness" too.
>
> Cheers,
> Per
>
>
>
>
>
> On Thursday, September 13, 2012 8:12:35 PM UTC+2, psm wrote:
>>
>> Jeff,
>>
>> these are good ideas and suggestions.  we are working on a number of
>> different strategies to ameliorate these issues.  some of the items you are
>> suggesting are already in progress, and others besides.  and i agree that
>> this is a general philosophical challenge with PaaS.  on GAE we now
>> regularly serve several hundreds of thousands of applications, so it is
>> indeed a challenge to handle the "long tail" problem.  we are aware of
>> this, and you should expect us to be rolling out a number of things to
>> address it.  in fact, we expect to make our experience of running this
>> large workload over a long period of time into an advantage with GAE.
>>
>> Peter S Magnusson
>> (GAE Eng Dir)
>>
>>
>> On Wednesday, September 12, 2012 5:35:39 PM UTC-7, Jeff Schnitzer wrote:
>>>
>>> On Wed, Sep 12, 2012 at 3:12 PM, Kaan Soral <[email protected]> wrote:
>>> > This is why I love App Engine, when a problem occurs instead of having
>>> a
>>> > heart attack or committing suicide, you can just wait for it to be
>>> resolved.
>>>
>>> Hmmm.  This really unfortunately timed incident may have cost us an
>>> important client, so I'm not feeling the love.
>>>
>>> I have quite a lot of experience building and running large online
>>> systems prior to embracing GAE and my products have never had as much
>>> downtime as I've had over the last year.  It hasn't always been
>>> Google's fault (the entire .st registry going down for 8+ hours really
>>> sucked[1]) but it usually has been.  See:
>>>
>>>  * Instance startup time ballooning by 3X and hitting deadlines
>>> (multiple occasions)
>>>  * GAE blocking CloudFlare with an undocumented security system
>>>  * This incident, where Java instances started mysteriously failing
>>>
>>> Would waiting have fixed these issues?  I'm not convinced.  Google may
>>> have smart people running GAE but they aren't watching _my_ app,
>>> they're just watching for an uptick in the number of complaints.  If
>>> you're doing something slightly unusual (say, running a CF reverse
>>> proxy), you might be statistical noise.  Apparently this Java problem
>>> _was_ widespread, but I had no way of knowing that.
>>>
>>> GAE's value proposition is that it's better to have Google's smart
>>> engineers building and maintaining your infrastructure.  But my site
>>> would be more reliable if I had one dumb person (possibly me) who
>>> cares specifically about _my_ infrastructure.  I've screwed up
>>> deployments and upgrades in production before, but at least I'm aware
>>> when changes happen, get immediate feedback, and can fix the problem
>>> right then and there.
>>>
>>> With GAE, the only thing I can do when my alarms go off is to whine as
>>> loudly as possible.  But there is no feedback!  I have no way of
>>> knowing if Google is working on the problem or if they're still
>>> waiting for more complaints that will never materialize.  Will I be
>>> down for 15 minutes, 1 hour, 2 hours, 8 hours, forever?  How long do
>>> you want to wait?
>>>
>>> This feels like a fundamental flaw in the PaaS concept, destined to
>>> produce multiple-hour downtimes at irregular intervals.  The feedback
>>> loop is too slow (and lossy if the problem is not widespread).
>>> There's no amount of QA or testing that will prevent failures in a
>>> system as big as complicated as GAE.  So the only reasonable option is
>>> to get that feedback loop shorter.  How can that happen?  Some ideas:
>>>
>>>  * Google could announce when they are rolling out changes.  I don't
>>> need release notes (although it would be nice to know what to watch
>>> for) but I'd like to know when I should pay extra attention.  Or not
>>> schedule client demos.  Facebook does something like this, rolling out
>>> platform changes on specific days of the week (which I long ago
>>> stopped caring about).
>>>
>>>  * Google could make extra support channels available during this
>>> time.  Hell, use twitter.  Think of us as your QA staff - if we see
>>> something amiss, we'd like to let you know.
>>>
>>>  * Google could be more transparent about problems as they happen.
>>> When you know there is an issue, let us know.  Since I must assume
>>> that any problem which Google hasn't acknowledged is a problem Google
>>> doesn't know about, I can stop spamming @google.com addresses.
>>>
>>>  * Google could monitor our apps, and compare error rates before
>>> rollout to error rates after rollout.  Ideally you'd break this down
>>> by component; figure out which apps use the search api, so when you
>>> roll out changes to the search system, you're specifically watching
>>> for an uptick in 500 errors from those apps.  Something like that.
>>>
>>> Any other ideas?  I really like GAE and I really like the PaaS
>>> concept.  But reliability is really a problem.  It's probably going to
>>> be an even bigger problem going on into the future as GAE (hopefully)
>>> adds new features and gets a bigger footprint.  More moving parts
>>> means more failures.
>>>
>>> Jeff
>>>
>>> P.S. Paying $6k/yr for Premier Support is not the answer.  Whether or
>>> not that would solve my problem, that doesn't solve GAE's problem.
>>>
>>>    [1]: http://blorn.com/post/**29851770158/beware-cutesy-two-**
>>> letter-tlds-for-your-domain-**name<http://blorn.com/post/29851770158/beware-cutesy-two-letter-tlds-for-your-domain-name>
>>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/google-appengine/-/tmYWDN8r2pUJ.
>
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/google-appengine?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/google-appengine?hl=en.

Re: [google-appengine] Re: APP DOWN: App suddenly no longer starts, no code changes

Reply via email to