Jeff, these are good ideas and suggestions. we are working on a number of different strategies to ameliorate these issues. some of the items you are suggesting are already in progress, and others besides. and i agree that this is a general philosophical challenge with PaaS. on GAE we now regularly serve several hundreds of thousands of applications, so it is indeed a challenge to handle the "long tail" problem. we are aware of this, and you should expect us to be rolling out a number of things to address it. in fact, we expect to make our experience of running this large workload over a long period of time into an advantage with GAE.
Peter S Magnusson (GAE Eng Dir) On Wednesday, September 12, 2012 5:35:39 PM UTC-7, Jeff Schnitzer wrote: > > On Wed, Sep 12, 2012 at 3:12 PM, Kaan Soral <[email protected]<javascript:>> > wrote: > > This is why I love App Engine, when a problem occurs instead of having a > > heart attack or committing suicide, you can just wait for it to be > resolved. > > Hmmm. This really unfortunately timed incident may have cost us an > important client, so I'm not feeling the love. > > I have quite a lot of experience building and running large online > systems prior to embracing GAE and my products have never had as much > downtime as I've had over the last year. It hasn't always been > Google's fault (the entire .st registry going down for 8+ hours really > sucked[1]) but it usually has been. See: > > * Instance startup time ballooning by 3X and hitting deadlines > (multiple occasions) > * GAE blocking CloudFlare with an undocumented security system > * This incident, where Java instances started mysteriously failing > > Would waiting have fixed these issues? I'm not convinced. Google may > have smart people running GAE but they aren't watching _my_ app, > they're just watching for an uptick in the number of complaints. If > you're doing something slightly unusual (say, running a CF reverse > proxy), you might be statistical noise. Apparently this Java problem > _was_ widespread, but I had no way of knowing that. > > GAE's value proposition is that it's better to have Google's smart > engineers building and maintaining your infrastructure. But my site > would be more reliable if I had one dumb person (possibly me) who > cares specifically about _my_ infrastructure. I've screwed up > deployments and upgrades in production before, but at least I'm aware > when changes happen, get immediate feedback, and can fix the problem > right then and there. > > With GAE, the only thing I can do when my alarms go off is to whine as > loudly as possible. But there is no feedback! I have no way of > knowing if Google is working on the problem or if they're still > waiting for more complaints that will never materialize. Will I be > down for 15 minutes, 1 hour, 2 hours, 8 hours, forever? How long do > you want to wait? > > This feels like a fundamental flaw in the PaaS concept, destined to > produce multiple-hour downtimes at irregular intervals. The feedback > loop is too slow (and lossy if the problem is not widespread). > There's no amount of QA or testing that will prevent failures in a > system as big as complicated as GAE. So the only reasonable option is > to get that feedback loop shorter. How can that happen? Some ideas: > > * Google could announce when they are rolling out changes. I don't > need release notes (although it would be nice to know what to watch > for) but I'd like to know when I should pay extra attention. Or not > schedule client demos. Facebook does something like this, rolling out > platform changes on specific days of the week (which I long ago > stopped caring about). > > * Google could make extra support channels available during this > time. Hell, use twitter. Think of us as your QA staff - if we see > something amiss, we'd like to let you know. > > * Google could be more transparent about problems as they happen. > When you know there is an issue, let us know. Since I must assume > that any problem which Google hasn't acknowledged is a problem Google > doesn't know about, I can stop spamming @google.com addresses. > > * Google could monitor our apps, and compare error rates before > rollout to error rates after rollout. Ideally you'd break this down > by component; figure out which apps use the search api, so when you > roll out changes to the search system, you're specifically watching > for an uptick in 500 errors from those apps. Something like that. > > Any other ideas? I really like GAE and I really like the PaaS > concept. But reliability is really a problem. It's probably going to > be an even bigger problem going on into the future as GAE (hopefully) > adds new features and gets a bigger footprint. More moving parts > means more failures. > > Jeff > > P.S. Paying $6k/yr for Premier Support is not the answer. Whether or > not that would solve my problem, that doesn't solve GAE's problem. > > [1]: > http://blorn.com/post/29851770158/beware-cutesy-two-letter-tlds-for-your-domain-name > > -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To view this discussion on the web visit https://groups.google.com/d/msg/google-appengine/-/J1qn8o1RjtwJ. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/google-appengine?hl=en.
