Qiming Teng <teng...@linux.vnet.ibm.com> wrote on 07/02/2014 03:02:14 AM:
> Just some random thoughts below ... > > On Tue, Jul 01, 2014 at 03:47:03PM -0400, Mike Spreitzer wrote: > > ... > > I have not found design discussion of this; have I missed something? > > > > I suppose the natural answer for OpenStack would be centered around > > webhooks... > > Well, I would suggest we generalize this into a event messaging or > signaling solution, instead of just 'webhooks'. The reason is that > webhooks as it is implemented today is not carrying a payload of useful > information -- I'm referring to the alarms in Ceilometer. OK, this is great (and Steve Hardy provided more details in his reply), I did not know about the existing abilities to have a payload. However Ceilometer alarms are still deficient in that way, right? A Ceilometer alarm's action list is simply a list of URLs, right? I would be happy to say let's generalize Ceilometer alarms to allow a payload in an action. > There are other cases as well. A member failure could be caused by a > temporary communication problem, which means it may show up quickly when > a replacement member is already being created. It may mean that we have > to respond to an 'online' event in addition to an 'offline' event? > ... > The problem here today is about the recovery of SG member. If it is a > compute instance, we can 'reboot', 'rebuild', 'evacuate', 'migrate' it, > just to name a few options. The most brutal way to do this is like what > HARestarter is doing today -- delete followed by a create. We could get into arbitrary subtlety, and maybe eventually will do better, but I think we can start with a simple solution that is widely applicable. The simple solution is that once the decision has been made to do convergence on a member (note that this is distinct from merely detecting and noting a divergence) then it will be done regardless of whether the doomed member later appears to have recovered, and the convergence action for a scaling group member is to delete the old member and create a replacement (not in that order). > > When the member is a nested stack and Ceilometer exists, it could be the > > member stack's responsibility to include a Ceilometer alarm that detects > > the member stack's death and hit the member stack's deletion webhook. > > This is difficult. A '(nested) stack' is a Heat specific abstraction -- > recall that we have to annotate a nova server resource in its metadata > to which stack this server belongs. Besides the 'visible' resources > specified in a template, Heat may create internal data structures and/or > resources (e.g. users) for a stack. I am not quite sure a stack's death > can be easily detected from outside Heat. It would be at least > cumbersome to have Heat notify Ceilometer that a stack is dead, and then > have Ceilometer send back a signal. A (nested) stack is not only a heat-specific abstraction but its semantics and failure modes are specific to the stack (at least, its template). I think we have no practical choice but to let the template author declare how failure is detected. It could be as simple as creating a Ceilometer alarms that detect death one or more resources in the nested stack; it could be more complicated Ceilometer stuff; it could be based on something other than, or in addition to, Ceilometer. If today there are not enough sensors to detect failures of all kinds of resources, I consider that a gap in telemetry (and think it is small enough that we can proceed usefully today, and should plan on filling that gap over time). > > There is a small matter of how the author of the template used to create > > the member stack writes some template snippet that creates a Ceilometer > > alarm that is specific to a member stack that does not exist yet. > > How about just one signal responder per ScalingGroup? A SG is supposed > to be in a better position to make the judgement: do I have to recreate > a failed member? am I recreating it right now or wait a few seconds? > maybe I should recreate the member on some specific AZs? That is confusing two issues. The thing that is new here is making the scaling group recognize member failure; the primary reaction is to update its accounting of members (which, in the current code, must be done by making sure the failed member is deleted); recovery of other scaling group aspects is fairly old-hat, it is analogous to the problems that the scaling group already solves when asked to increase its size. > ... > > I suppose we could stipulate that if the member template includes a > > parameter with name "member_name" and type "string" then the OS OG takes > > care of supplying the correct value of that parameter; as illustrated in > > the asg_of_stacks.yaml of https://review.openstack.org/#/c/97366/ , a > > member template can use a template parameter to tag Ceilometer data for > > querying. The URL of the member stack's deletion webhook could be passed > > to the member template via the same sort of convention. > > I am not in favor of the per-member webhook design. But I vote for an > additional *implicit* parameter to a nested stack of any groups. It > could be an index or a name. Right, I was elaborating on a particular formulation of "implicit parameter". In particular, I suggested an "implicit parameter value" for an optional explicit parameter. We could make the parameter declaration implicit, but that (1) is a bit irregular (reminiscent of "modes") if we only do it for stacks that are scaling group members and (2) is equivalent to the existing concept of psuedo-parameters if we do it for all stacks. I would be content with adding a pseudo-parameter for all stacks that is the UUID of the stack. The index of the member in the group could be problematic, as those are re-used; the UUID is not re-used. Names also have issues with uniqueness. > > When Ceilometer > > does not exist, it is less obvious to me what could usefully be done. Are > > there any useful SG member types besides Compute instances and nested > > stacks? Note that a nested stack could also pass its member deletion > > webhook to a load balancer (that is willing to accept such a thing, of > > course), so we get a lot of unity of mechanism between the case of > > detection by infrastructure vs. application level detection. > > > > I'm a little bit concerned about passing the member deletion webhook to > LB. Maybe we need to rethink about this: do we really want to bring > application level design considerations down to the infrastructure level? I look at it this way: do we want two completely independent loops of detection and response, or shall we share a common response mechanism with two different levels of detection? I think both want the same response, and so recommend a shared response mechanism. > Some of the detection work might be covered by the observer engine specs > that is under review. My doubt about it is about how to make it "listen > only to what need to know while ignore everything else". I am not sure what you mean by that. If this is about the case of the group members being nested stacks, I go back to the idea that it must be up to the nested template author to define failure (via declaring how to detect it). > > I am not entirely happy with the idea of a webhook per member. If I > > understand correctly, generating webhooks is a somewhat expensive and > > problematic process. What would be the alternative? > > My understanding is that the webhooks' problem is not about cost, it is > more about authentication and flexibility. Steve Hardy and Thomas Herve > are already looking into the authentication problem. I was not disagreeing, I was including those in "problematic". Thanks, Mike
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev