On 22/12/14 13:21, Steven Hardy wrote:
Hi all,

So, lately I've been having various discussions around $subject, and I know
it's something several folks in our community are interested in, so I
wanted to get some ideas I've been pondering out there for discussion.

I'll start with a proposal of how we might replace HARestarter with
AutoScaling group, then give some initial ideas of how we might evolve that
into something capable of a sort-of active/active failover.

1. HARestarter replacement.

My position on HARestarter has long been that equivalent functionality
should be available via AutoScalingGroups of size 1.  Turns out that
shouldn't be too hard to do:

  resources:
   server_group:
     type: OS::Heat::AutoScalingGroup
     properties:
       min_size: 1
       max_size: 1
       resource:
         type: ha_server.yaml

   server_replacement_policy:
     type: OS::Heat::ScalingPolicy
     properties:
       # FIXME: this adjustment_type doesn't exist yet
       adjustment_type: replace_oldest
       auto_scaling_group_id: {get_resource: server_group}
       scaling_adjustment: 1

One potential issue with this is that it is a little bit _too_ equivalent to HARestarter - it will replace your whole scaled unit (ha_server.yaml in this case) rather than just the failed resource inside.

So, currently our ScalingPolicy resource can only support three adjustment
types, all of which change the group capacity.  AutoScalingGroup already
supports batched replacements for rolling updates, so if we modify the
interface to allow a signal to trigger replacement of a group member, then
the snippet above should be logically equivalent to HARestarter AFAICT.

The steps to do this should be:

  - Standardize the ScalingPolicy-AutoScaling group interface, so
aynchronous adjustments (e.g signals) between the two resources don't use
the "adjust" method.

  - Add an option to replace a member to the signal interface of
AutoScalingGroup

  - Add the new "replace adjustment type to ScalingPolicy

I think I am broadly in favour of this.

I posted a patch which implements the first step, and the second will be
required for TripleO, e.g we should be doing it soon.

https://review.openstack.org/#/c/143496/
https://review.openstack.org/#/c/140781/

2. A possible next step towards active/active HA failover

The next part is the ability to notify before replacement that a scaling
action is about to happen (just like we do for LoadBalancer resources
already) and orchestrate some or all of the following:

- Attempt to quiesce the currently active node (may be impossible if it's
   in a bad state)

- Detach resources (e.g volumes primarily?) from the current active node,
   and attach them to the new active node

- Run some config action to activate the new node (e.g run some config
   script to fsck and mount a volume, then start some application).

The first step is possible by putting a SofwareConfig/SoftwareDeployment
resource inside ha_server.yaml (using NO_SIGNAL so we don't fail if the
node is too bricked to respond and specifying DELETE action so it only runs
when we replace the resource).

The third step is possible either via a script inside the box which polls
for the volume attachment, or possibly via an update-only software config.

The second step is the missing piece AFAICS.

I've been wondering if we can do something inside a new heat resource,
which knows what the current "active" member of an ASG is, and gets
triggered on a "replace" signal to orchestrate e.g deleting and creating a
VolumeAttachment resource to move a volume between servers.

Something like:

  resources:
   server_group:
     type: OS::Heat::AutoScalingGroup
     properties:
       min_size: 2
       max_size: 2
       resource:
         type: ha_server.yaml

   server_failover_policy:
     type: OS::Heat::FailoverPolicy
     properties:
       auto_scaling_group_id: {get_resource: server_group}
       resource:
         type: OS::Cinder::VolumeAttachment
         properties:
             # FIXME: "refs" is a ResourceGroup interface not currently
             # available in AutoScalingGroup
             instance_uuid: {get_attr: [server_group, refs, 1]}

   server_replacement_policy:
     type: OS::Heat::ScalingPolicy
     properties:
       # FIXME: this adjustment_type doesn't exist yet
       adjustment_type: replace_oldest
       auto_scaling_policy_id: {get_resource: server_failover_policy}
       scaling_adjustment: 1

This actually fails because a VolumeAttachment needs to be updated in place; if you try to switch servers but keep the same Volume when replacing the attachment you'll get an error.

TBH {get_attr: [server_group, refs, 1]} is doing most of the heavy lifting here, so in theory you could just have an OS::Cinder::VolumeAttachment instead of the FailoverPolicy and then all you need is a way of triggering a stack update with the same template & params. I know Ton added a PATCH method to update in Juno so that you don't have to pass parameters any more, and I believe it's planned to do the same with the template.

By chaining policies like this we could trigger an update on the attachment
resource (or a nested template via a provider resource containing many
attachments or other resources) every time the ScalingPolicy is triggered.

For the sake of clarity, I've not included the existing stuff like
ceilometer alarm resources etc above, but hopefully it gets the idea
accross so we can discuss further, what are peoples thoughts?  I'm quite
happy to iterate on the idea if folks have suggestions for a better
interface etc :)

One problem I see with the above approach is you'd have to trigger a
failover after stack create to get the initial volume attached, still
pondering ideas on how best to solve that..

To me this is falling into the same old trap of "hey, we want to run this custom workflow, all we need to do is add a new resource type to hang some code on". That's pretty much how we got HARestarter.

Also, like HARestarter, this cannot hope to cover the range of possible actions that might be needed by various applications.

IMHO the "right" way to implement this is that the Ceilometer alarm triggers a workflow in Mistral that takes the appropriate action defined by the user, which may (or may not) include updating the Heat stack to a new template where the shared storage gets attached to a different server.

cheers,
Zane.

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to