Re: [openstack-dev] [Heat] [TripleO] Rolling updates spec re-written. RFC

Zane Bitter Wed, 05 Feb 2014 09:27:09 -0800

On 04/02/14 20:34, Robert Collins wrote:

On 5 February 2014 13:14, Zane Bitter <zbit...@redhat.com> wrote:

That's not a great example, because one DB server depends on the other,
forcing them into updating serially anyway.

I have to say that even in general, this whole idea about applying update
policies to non-grouped resources doesn't make a whole lot of sense to me.
For non-grouped resources you control the resource definitions individually
- if you don't want them to update at a particular time, you have the option
of just not updating them.


Well, I don't particularly like the idea of doing thousands of
discrete heat stack-update calls, which would seem to be what you're
proposing.


I'm not proposing you do it by hand if that's any help ;)

Ideally a workflow service would exist that could do the messy parts foryou, but at the end of the day it's just a for-loop in your code. Fromwhat you say below, I think you started down the path of managing a lotof complexity yourself when you were forced to generate templates forserver groups rather than use autoscaling. I think it would be betterfor _everyone_ for us to put resources into helping TripleO get off thatpath rather than it would for us to put resources into making it lessinconvenient to stay on it.

On groups: autoscale groups are a problem for secure minded
deployments because every server has identical resources (today) and
we very much want discrete credentials per server - at least this is
my understanding of the reason we're not using scaling groups in
TripleO.

OK, I wasn't aware that y'all are not using scaling groups. It soundslike this is the real problem we should be addressing, because everyonewants secure-minded deployments and nobody wants to have to manuallydefine the configs for their 1000 all-but-identical servers. If we had amechanism to ensure that every server in a scaling group could obtainits own credentials then it seems to me that the issue of whether toapply autoscaling-style rolling upgrades to manually-defined groups ofresources becomes moot.

(Note: if anybody read that paragraph and started thinking "hey, wecould make Turing-complete programmable template templates using theJSON equivalent of XSLT, please just stop right now kthx.)

Where you _do_ need it is for scaling groups where every server is based on
the same launch config, so you need a way to control the members
individually - by batching up operations (done), adding delays (done) or,
even better, notifications and callbacks.

So it seems like doing 'rolling' updates for any random subset of resources
is effectively turning Heat into something of a poor-man's workflow service,
and IMHO that is probably a mistake.


I mean to reply to the other thread, but here is just as good :) -
heat as a way to describe the intended state, and heat takes care of
transitions, is a brilliant model. It absolutely implies a bunch of
workflows - the AWS update policy is probably the key example.

Absolutely. Orchestration works by building a workflow internally, whichHeat then also executes. No disagreement there.

Being able to gracefully, *automatically* work through a transition
between two defined states, allowing the nodes in question to take
care of their own needs along the way seems like a pretty core
function to fit inside Heat itself. Its not at all the same as 'allow
users to define abitrary workflows'.

That's fair and, I like to think, consistent with what I was suggestingbelow.

What we do need for all resources (not just scaling groups) is a way for the
user to say "for this particular resource, notify me when it has updated
(but, if possible, before we have taken any destructive actions on it), give
me a chance to test it and accept or reject the update". For example, when
you resize a server, give the user a chance to confirm or reject the change
at the VERIFY_RESIZE step (Trove requires this). Or when you replace a
server during an update, give the user a chance to test the new server and
either keep it (continue on and delete the old one) or not (roll back). Or
when you replace a server in a scaling group, notify the load balancer _or
some other thing_ (e.g. OpenShift broker node) that a replacement has been
created and wait for it to switch over to the new one before deleting the
old one. Or, of course, when you update a server to some new config, give
the user a chance to test it out and make sure it works before continuing
with the stack update. All of these use cases can, I think, be solved with a
single feature.

The open questions for me are:
1) How do we notify the user that it's time to check on a resource?
(Marconi?)


This is the graceful update stuff I referred to in my mail to Clint -
the proposal from hallway discussions in HK was to do this by
notifying the server itself (that way we don't create a centralised
point of fail). I can see though that in a general sense not all
resources are servers. But - how about allowing to specify where to
notify (and notifing is always by setting a value in metadata
somewhere) - users can then pull that out themselves however they want
to. Adding push notifications is orthogonal IMO - we'd like that for
all metadata changes, for instance.

TBH I think I would be OK with it only working for servers. I can'tthink of a non-server case where this would be interesting (though ifsomebody else can, please speak now). I guess, say, a Trove instance is"server-like", but in that these kinds of notifications should behandled under the hood by Trove (using a Heat template containing anactual server) and not by the user anyway.

The idea of sending the notification to the server itself is aninteresting one, and I can see it working well in many cases(particularly for a completely user-controlled client-server system likee.g. OpenShift). However, I can imagine some potential issues,especially in what I expect to be the most common case of a nunch of appservers behind a load balancer:

* Do we want to force the user to have an in-instance agent for managingthe load balancer (I realise this is fine for TripleO, but in general...)?* What if you're replacing a server and the old one is already dead forsome reason? How do you get it out of the load balancer then?* Do we want the server to contain credentials that would allow it tomanipulate the load balancer?

Maybe the answer is that we continue to special-case load balancers, butthis is already causing us problems (since the ecosystem has not yetstandardised on the Neutron LBaaS) and it would be really nice if wecould use a single mechanism for everything.

2) How does the user ack/nack? (You're suggesting reusing WaitCondition, and
that makes sense to me.)


The server would use a WaitCondition yes.

3) How do we break up the operations so the notification occurs at the right
time? (With difficulty, but it should be do-able.)


Just wrap the existing operations - if <should notify> then:
notify-wait-do, otherwise just do.

Yeah, this is not an important discussion to have right here and now.Suffice it to say that there are some subtleties - e.g. theVERIFY_RESIZE thing happens in the _middle_ of an existing operation, sojust wrapping the existing operations is, sadly, not sufficient.Something worth keeping in mind when it comes to implementation time.

4) How does the user indicate for which resources they want to be notified?
(Inside an update_policy? Another new directive at the
type/properties/depends_on/update_policy level?)


I would say per resource.

I agree (and, to be clear, all of the options listed were intended to beper-resource), but the exact syntax remains an open question.


cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Heat] [TripleO] Rolling updates spec re-written. RFC

Reply via email to