On 04/02/14 20:34, Robert Collins wrote:
On 5 February 2014 13:14, Zane Bitter <zbit...@redhat.com> wrote:


That's not a great example, because one DB server depends on the other,
forcing them into updating serially anyway.

I have to say that even in general, this whole idea about applying update
policies to non-grouped resources doesn't make a whole lot of sense to me.
For non-grouped resources you control the resource definitions individually
- if you don't want them to update at a particular time, you have the option
of just not updating them.

Well, I don't particularly like the idea of doing thousands of
discrete heat stack-update calls, which would seem to be what you're
proposing.

I'm not proposing you do it by hand if that's any help ;)

Ideally a workflow service would exist that could do the messy parts for you, but at the end of the day it's just a for-loop in your code. From what you say below, I think you started down the path of managing a lot of complexity yourself when you were forced to generate templates for server groups rather than use autoscaling. I think it would be better for _everyone_ for us to put resources into helping TripleO get off that path rather than it would for us to put resources into making it less inconvenient to stay on it.

On groups: autoscale groups are a problem for secure minded
deployments because every server has identical resources (today) and
we very much want discrete credentials per server - at least this is
my understanding of the reason we're not using scaling groups in
TripleO.

OK, I wasn't aware that y'all are not using scaling groups. It sounds like this is the real problem we should be addressing, because everyone wants secure-minded deployments and nobody wants to have to manually define the configs for their 1000 all-but-identical servers. If we had a mechanism to ensure that every server in a scaling group could obtain its own credentials then it seems to me that the issue of whether to apply autoscaling-style rolling upgrades to manually-defined groups of resources becomes moot.

(Note: if anybody read that paragraph and started thinking "hey, we could make Turing-complete programmable template templates using the JSON equivalent of XSLT, please just stop right now kthx.)

Where you _do_ need it is for scaling groups where every server is based on
the same launch config, so you need a way to control the members
individually - by batching up operations (done), adding delays (done) or,
even better, notifications and callbacks.

So it seems like doing 'rolling' updates for any random subset of resources
is effectively turning Heat into something of a poor-man's workflow service,
and IMHO that is probably a mistake.

I mean to reply to the other thread, but here is just as good :) -
heat as a way to describe the intended state, and heat takes care of
transitions, is a brilliant model. It absolutely implies a bunch of
workflows - the AWS update policy is probably the key example.

Absolutely. Orchestration works by building a workflow internally, which Heat then also executes. No disagreement there.

Being able to gracefully, *automatically* work through a transition
between two defined states, allowing the nodes in question to take
care of their own needs along the way seems like a pretty core
function to fit inside Heat itself. Its not at all the same as 'allow
users to define abitrary workflows'.

That's fair and, I like to think, consistent with what I was suggesting below.

What we do need for all resources (not just scaling groups) is a way for the
user to say "for this particular resource, notify me when it has updated
(but, if possible, before we have taken any destructive actions on it), give
me a chance to test it and accept or reject the update". For example, when
you resize a server, give the user a chance to confirm or reject the change
at the VERIFY_RESIZE step (Trove requires this). Or when you replace a
server during an update, give the user a chance to test the new server and
either keep it (continue on and delete the old one) or not (roll back). Or
when you replace a server in a scaling group, notify the load balancer _or
some other thing_ (e.g. OpenShift broker node) that a replacement has been
created and wait for it to switch over to the new one before deleting the
old one. Or, of course, when you update a server to some new config, give
the user a chance to test it out and make sure it works before continuing
with the stack update. All of these use cases can, I think, be solved with a
single feature.

The open questions for me are:
1) How do we notify the user that it's time to check on a resource?
(Marconi?)

This is the graceful update stuff I referred to in my mail to Clint -
the proposal from hallway discussions in HK was to do this by
notifying the server itself (that way we don't create a centralised
point of fail). I can see though that in a general sense not all
resources are servers. But - how about allowing to specify where to
notify (and notifing is always by setting a value in metadata
somewhere) - users can then pull that out themselves however they want
to. Adding push notifications is orthogonal IMO - we'd like that for
all metadata changes, for instance.

TBH I think I would be OK with it only working for servers. I can't think of a non-server case where this would be interesting (though if somebody else can, please speak now). I guess, say, a Trove instance is "server-like", but in that these kinds of notifications should be handled under the hood by Trove (using a Heat template containing an actual server) and not by the user anyway.

The idea of sending the notification to the server itself is an interesting one, and I can see it working well in many cases (particularly for a completely user-controlled client-server system like e.g. OpenShift). However, I can imagine some potential issues, especially in what I expect to be the most common case of a nunch of app servers behind a load balancer:

* Do we want to force the user to have an in-instance agent for managing the load balancer (I realise this is fine for TripleO, but in general...)? * What if you're replacing a server and the old one is already dead for some reason? How do you get it out of the load balancer then? * Do we want the server to contain credentials that would allow it to manipulate the load balancer?

Maybe the answer is that we continue to special-case load balancers, but this is already causing us problems (since the ecosystem has not yet standardised on the Neutron LBaaS) and it would be really nice if we could use a single mechanism for everything.

2) How does the user ack/nack? (You're suggesting reusing WaitCondition, and
that makes sense to me.)

The server would use a WaitCondition yes.

3) How do we break up the operations so the notification occurs at the right
time? (With difficulty, but it should be do-able.)

Just wrap the existing operations - if <should notify> then:
notify-wait-do, otherwise just do.

Yeah, this is not an important discussion to have right here and now. Suffice it to say that there are some subtleties - e.g. the VERIFY_RESIZE thing happens in the _middle_ of an existing operation, so just wrapping the existing operations is, sadly, not sufficient. Something worth keeping in mind when it comes to implementation time.

4) How does the user indicate for which resources they want to be notified?
(Inside an update_policy? Another new directive at the
type/properties/depends_on/update_policy level?)

I would say per resource.

I agree (and, to be clear, all of the options listed were intended to be per-resource), but the exact syntax remains an open question.

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to