Thanks, Alex, that was a great post. Very much appreciate your time in putting that together. To sum it up, we have this architectural pattern (and patterns are just distilled versions of many experiences, so we get a great start when we use these patterns but it up to us to build out the last mile) for doing 'very gradual' roll-outs and we need to consider how to use App Engine Traffic Splitting as a tool in this larger process/pattern.
Thanks again! Hardwick On Sun, Jul 26, 2015 at 3:45 PM, Alex Martelli <[email protected]> wrote: > On Fri, Jul 24, 2015 at 1:14 PM, Michael Spainhower <[email protected]> > wrote: > >> Sure, I don't think David or I have any question of whether canary >> testing is a possible or intended use case of traffic splitting. I am >> interested in how folks have implemented it in practice for production apps. >> > > I've done it in the past for some internal apps (on internal > infrastructure, not app engine, but the architectural patterns are pretty > much the same). > > Rather than just "canary testing", I preferred to frame things as a "very > gradual roll-out". Have the new version of whatever micro-service X we're > updating start by handling, say, 5% of new incoming queries (depending on > the overall QPS to your app, you may need to tweak that 5% -- you do need > the new version to be handling enough queries to give you a statistically > representative sample, but few enough that if a failure mode is revealed it > will only inconvenience not-too-many customers... a delicate balance to > draw). > > Monitor the new version in depth, not just for "health" (though that's of > course crucial), but also resource consumption and latency in response for > the new version compared to the existing one -- if it's a pretty "core" > micro-service, doubling latency, or doubling the consumption of some > constrained resource, may not be something you can afford -- probably > better to roll back and return to the drawing board to find out what's > happening and (one hopes:-) fix it. ("peripheral" micro-services, e.g ones > only occasionally used, may give you more latitude in what extra latency > and/or resource consumption you can tolerate). > > If everything is fine at e.g 5%, then move to 10% -- rinse, repeat. Almost > always, a new micro-service version that's fine at 5% will also be fine at > 10%, 20%, etc, as well -- but, "almost" is not quite good enough for a > mission-critical production app. E.g, there may be a rare but occasionally > occurring "query of death" for the new version, specifically tickling some > bug that's usually dormant there -- and you may not meet any occurrence of > the QoD at, say, 40%, but you might occasionally see some at 60%. > > Which is what drilled into me the "no shortcuts" stance -- "hope is not a > strategy" and all that. I'd rather take several days to complete the > roll-out (and, if I'm the team's manager or lead, play "lightning rod" to > protect the rest of the team from pressure by stakeholders) and do so with > complete confidence, than risk emergencies surging and hurting users' > workloads -- "think of the user, and everything else follows" is a mantra I > have long lived by. > > So this is just one (though important!) of the architecture patterns that > micro-services bring to the fore -- it would also exist for a monolithic > app, just not quite so prominent and without the many bells and whistles > micro-services suggest (e.g, the ability for any app component to back down > to a "known good" version of a micro-service it consumes, if and when it > can detect -- or strongly suspect, e.g by timeouts -- that it's being > served by a new but alas defective version of some micro-service or > other... with plenty of logging and pagers ringing of course, but, that > goes without saying to anybody who knows what Devops' all about:-). > > I'm just back from OSCON, and micro-services were all over the place (I > even had a couple of slides mentioning them, very much in passing alas!, as > part of my own "Modern Python patterns and idioms" talk there:-) -- but I > was seriously disappointed by not seeing any of these key architectural > patterns explored in depth... > > It's as if every one of these talks was "μS 101" with some specific twist > wrt language and/or underlying platform, each interesting, mind you!, but, > none of the ones I got diving deep enough into the core architecture > patterns (only weakly dependent on platform and language) that we all need > to learn and refine as the new architectures emerge. > > An opportunity to submit appropriate talks for the *next* OSCON I guess > (Austin, TX, May 2016)...:-). > > > > For example, I don't have a great solution for canary testing a version > which changes the ndb model schema. I would love to hear the concrete > lessons learned from anyone who has done such a thing. > > Alas, schema changes are always a bother, no matter the underlying > technologies and architectures. I can't offer more than applause to David's > post below, for starting to highlight some of the relevant issues -- TL;DR, > that schema changes must be done in an incremental way, always having the > code that wants/prefers/supports schema version N to operate > non-destructively on versions N-1 *and* N+1 as well. A bother indeed, > but, I have no silver bullet to slay that particular werewolf, sorry. > > This isn't limited to canarying or incremental roll-out, though those > patterns highlight the problem in particularly stark colors. But even back > in the times of big-bang upgrades and hours of downtime to let them happen, > I think I've witnessed more release/upgrade disasters tied to schema > changes, than to any other single root cause... the risks are just more > obvious and blocking today, rather than deeply hidden, and that extra > visibility need not be a bad thing, in fact. > > Another example is how do you elegantly synchronize decoupled apps? What >> I mean is that e.g., we run our APIs in a different project than our web >> front-end. There are several ways to handle this, but again would love to >> get war stories from anyone who has run something similar in production. >> > > I may be missing something here -- I'd expect the web front-end to be a > consumer of the APIs just like any other front-end would (an excellent > architectural separation), so e.g a new API version would be handled by the > web front-end just as it would by any other client (mobile native apps, > etc, etc) -- use explicit versioning, version negotiation, and so forth -- > just general best-practice patterns of API architecture, no? > > I'm sure there are other use cases that show your point better, so, let's > please discuss them! > > > Alex > > > >> >> >> >> On Friday, July 24, 2015 at 3:56:27 PM UTC-4, Jason Collins wrote: >>> >>> Traffic-splitting / canary releases on App Engine are definitely "a >>> thing". >>> >>> Traffic-splits on non-default modules are now available via API: >>> >>> >>> https://cloud.google.com/appengine/docs/admin-api/quickstart/#splitting_traffic >>> >>> >>> On Friday, 24 July 2015 11:50:33 UTC-7, Michael Spainhower wrote: >>>> >>>> @David, I started following this thread because I have the exact same >>>> question and agree the lack of response is worrisome. >>>> >>>> We are in the Cloud Startup Program and I plan to ask about canary >>>> testing during my next engineering 1-on-1. I will reply to this thread >>>> with what I learn from their engineer. >>>> >>>> >>>> >>>> On Friday, July 24, 2015 at 9:14:02 AM UTC-4, David Hardwick wrote: >>>>> >>>>> Oh boy, the lack of response here is not encouraging >>>>> >>>>> On Tuesday, July 21, 2015 at 1:53:17 PM UTC-4, David Hardwick wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> We haven't used Traffic Splitting yet but it has been an available >>>>>> feature for a while and rumor has it that traffic splitting for >>>>>> non-default >>>>>> modules could be coming in as soon as a month. >>>>>> >>>>>> Any who, if you have experience use it, then I would like to hear how >>>>>> you are using it to roll out new features or versions. I've heard the >>>>>> term >>>>>> 'canary' testing where you roll out a new version to 10% of folks...you >>>>>> measure the results and then either rollback and fully roll it out. So >>>>>> if >>>>>> anyone is doing 'canary' testing and deployments as I've described it, >>>>>> then >>>>>> I've like to hear from you. >>>>>> >>>>>> Thanks in advance, >>>>>> Hardwick >>>>>> >>>>> -- >> You received this message because you are subscribed to the Google Groups >> "Google App Engine" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To post to this group, send email to [email protected]. >> Visit this group at http://groups.google.com/group/google-appengine. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/google-appengine/d7a1d39c-24c8-48f4-aee3-08a452a75148%40googlegroups.com >> <https://groups.google.com/d/msgid/google-appengine/d7a1d39c-24c8-48f4-aee3-08a452a75148%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- *David Hardwick* | CTO | w. 646-237-5388 3405 Piedmont Rd. NE, Suite 325, Atlanta, GA 30305 <http://www.bettercloud.com> *Featured Series:* The Latest Trends in Cloud IT <http://blog.bettercloud.com/category/trends-in-cloud-it/> calendar availability <https://www.google.com/calendar/[email protected]&ctz=America/New_York&mode=week> -- You received this message because you are subscribed to the Google Groups "Google App Engine" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/google-appengine. To view this discussion on the web visit https://groups.google.com/d/msgid/google-appengine/CAMYGC96Hru32zK5GRupsu_cGx4pToaWFYXA8H%2BN9%2BZ5h0Ny4JQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
