On Fri, Jul 24, 2015 at 1:14 PM, Michael Spainhower <[email protected]>
wrote:
> Sure, I don't think David or I have any question of whether canary testing
> is a possible or intended use case of traffic splitting. I am interested
> in how folks have implemented it in practice for production apps.
>
I've done it in the past for some internal apps (on internal
infrastructure, not app engine, but the architectural patterns are pretty
much the same).
Rather than just "canary testing", I preferred to frame things as a "very
gradual roll-out". Have the new version of whatever micro-service X we're
updating start by handling, say, 5% of new incoming queries (depending on
the overall QPS to your app, you may need to tweak that 5% -- you do need
the new version to be handling enough queries to give you a statistically
representative sample, but few enough that if a failure mode is revealed it
will only inconvenience not-too-many customers... a delicate balance to
draw).
Monitor the new version in depth, not just for "health" (though that's of
course crucial), but also resource consumption and latency in response for
the new version compared to the existing one -- if it's a pretty "core"
micro-service, doubling latency, or doubling the consumption of some
constrained resource, may not be something you can afford -- probably
better to roll back and return to the drawing board to find out what's
happening and (one hopes:-) fix it. ("peripheral" micro-services, e.g ones
only occasionally used, may give you more latitude in what extra latency
and/or resource consumption you can tolerate).
If everything is fine at e.g 5%, then move to 10% -- rinse, repeat. Almost
always, a new micro-service version that's fine at 5% will also be fine at
10%, 20%, etc, as well -- but, "almost" is not quite good enough for a
mission-critical production app. E.g, there may be a rare but occasionally
occurring "query of death" for the new version, specifically tickling some
bug that's usually dormant there -- and you may not meet any occurrence of
the QoD at, say, 40%, but you might occasionally see some at 60%.
Which is what drilled into me the "no shortcuts" stance -- "hope is not a
strategy" and all that. I'd rather take several days to complete the
roll-out (and, if I'm the team's manager or lead, play "lightning rod" to
protect the rest of the team from pressure by stakeholders) and do so with
complete confidence, than risk emergencies surging and hurting users'
workloads -- "think of the user, and everything else follows" is a mantra I
have long lived by.
So this is just one (though important!) of the architecture patterns that
micro-services bring to the fore -- it would also exist for a monolithic
app, just not quite so prominent and without the many bells and whistles
micro-services suggest (e.g, the ability for any app component to back down
to a "known good" version of a micro-service it consumes, if and when it
can detect -- or strongly suspect, e.g by timeouts -- that it's being
served by a new but alas defective version of some micro-service or
other... with plenty of logging and pagers ringing of course, but, that
goes without saying to anybody who knows what Devops' all about:-).
I'm just back from OSCON, and micro-services were all over the place (I
even had a couple of slides mentioning them, very much in passing alas!, as
part of my own "Modern Python patterns and idioms" talk there:-) -- but I
was seriously disappointed by not seeing any of these key architectural
patterns explored in depth...
It's as if every one of these talks was "μS 101" with some specific twist
wrt language and/or underlying platform, each interesting, mind you!, but,
none of the ones I got diving deep enough into the core architecture
patterns (only weakly dependent on platform and language) that we all need
to learn and refine as the new architectures emerge.
An opportunity to submit appropriate talks for the *next* OSCON I guess
(Austin, TX, May 2016)...:-).
> For example, I don't have a great solution for canary testing a version
which changes the ndb model schema. I would love to hear the concrete
lessons learned from anyone who has done such a thing.
Alas, schema changes are always a bother, no matter the underlying
technologies and architectures. I can't offer more than applause to David's
post below, for starting to highlight some of the relevant issues -- TL;DR,
that schema changes must be done in an incremental way, always having the
code that wants/prefers/supports schema version N to operate
non-destructively on versions N-1 *and* N+1 as well. A bother indeed, but,
I have no silver bullet to slay that particular werewolf, sorry.
This isn't limited to canarying or incremental roll-out, though those
patterns highlight the problem in particularly stark colors. But even back
in the times of big-bang upgrades and hours of downtime to let them happen,
I think I've witnessed more release/upgrade disasters tied to schema
changes, than to any other single root cause... the risks are just more
obvious and blocking today, rather than deeply hidden, and that extra
visibility need not be a bad thing, in fact.
Another example is how do you elegantly synchronize decoupled apps? What I
> mean is that e.g., we run our APIs in a different project than our web
> front-end. There are several ways to handle this, but again would love to
> get war stories from anyone who has run something similar in production.
>
I may be missing something here -- I'd expect the web front-end to be a
consumer of the APIs just like any other front-end would (an excellent
architectural separation), so e.g a new API version would be handled by the
web front-end just as it would by any other client (mobile native apps,
etc, etc) -- use explicit versioning, version negotiation, and so forth --
just general best-practice patterns of API architecture, no?
I'm sure there are other use cases that show your point better, so, let's
please discuss them!
Alex
>
>
>
> On Friday, July 24, 2015 at 3:56:27 PM UTC-4, Jason Collins wrote:
>>
>> Traffic-splitting / canary releases on App Engine are definitely "a
>> thing".
>>
>> Traffic-splits on non-default modules are now available via API:
>>
>>
>> https://cloud.google.com/appengine/docs/admin-api/quickstart/#splitting_traffic
>>
>>
>> On Friday, 24 July 2015 11:50:33 UTC-7, Michael Spainhower wrote:
>>>
>>> @David, I started following this thread because I have the exact same
>>> question and agree the lack of response is worrisome.
>>>
>>> We are in the Cloud Startup Program and I plan to ask about canary
>>> testing during my next engineering 1-on-1. I will reply to this thread
>>> with what I learn from their engineer.
>>>
>>>
>>>
>>> On Friday, July 24, 2015 at 9:14:02 AM UTC-4, David Hardwick wrote:
>>>>
>>>> Oh boy, the lack of response here is not encouraging
>>>>
>>>> On Tuesday, July 21, 2015 at 1:53:17 PM UTC-4, David Hardwick wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> We haven't used Traffic Splitting yet but it has been an available
>>>>> feature for a while and rumor has it that traffic splitting for
>>>>> non-default
>>>>> modules could be coming in as soon as a month.
>>>>>
>>>>> Any who, if you have experience use it, then I would like to hear how
>>>>> you are using it to roll out new features or versions. I've heard the
>>>>> term
>>>>> 'canary' testing where you roll out a new version to 10% of folks...you
>>>>> measure the results and then either rollback and fully roll it out. So if
>>>>> anyone is doing 'canary' testing and deployments as I've described it,
>>>>> then
>>>>> I've like to hear from you.
>>>>>
>>>>> Thanks in advance,
>>>>> Hardwick
>>>>>
>>>> --
> You received this message because you are subscribed to the Google Groups
> "Google App Engine" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/google-appengine.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/google-appengine/d7a1d39c-24c8-48f4-aee3-08a452a75148%40googlegroups.com
> <https://groups.google.com/d/msgid/google-appengine/d7a1d39c-24c8-48f4-aee3-08a452a75148%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>
--
You received this message because you are subscribed to the Google Groups
"Google App Engine" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/google-appengine.
To view this discussion on the web visit
https://groups.google.com/d/msgid/google-appengine/CAE46Be_Mnbrke7MrpUzL%2B300rGZm8LUn7Yfq6S0LOafgYbf-%3DA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.