On 04/15/2014 03:16 AM, Qiming Teng wrote:
What I saw in this thread are several topics:

1) Is VM HA really relevant (in a cloud)?

This is the most difficult question to answer, because it really depends
on who you are talking to, who are the user community you are facing.
IMHO, for most web-based applications that are born to run on cloud,
maybe certain level of business resiliency has already been built into
the code, so the application or service can live happily when VMs come
and go.

For traditional business applications, the scenario may be quite
different.  These apps are migrated to cloud for reasons like cost
savings, server consolidation, etc..  Quite some companies are
evaluating OpenStack for their "private cloud" -- which is a weird term,
IMHO.

In addition to this, while we are looking into the 'utility' vision of
cloud, we can still ask ourselves: a) can we survive one month of power
outage or water outage, though there are abundant supply elsewhere on
this
planet? b) what are the costs we need to pay if we eventually make it?
c) do we want to pay for this?

My personal experience is that our customers really want this feature
(VM HA) for their private clouds.  The question they asked us was:

"
   Does OpenStack support VM HA?  Maybe not for all VMS...
   We know we can have that using vSphere, Azure, or CloudStack...
"


2) Where is the best location to provide VM HA?

Suppose that we do feel the need to support VM HA, then the questions
following this would 'where' and 'how'.

Considering that a VM is not merely a bundle of compute processes, it is
actually a virtual execution environment that consumes resources like
storage and network bandwidth besides processor cycles, Nova may be NOT
the ideal location to deal with this cross-cutting concern.

High availability involves redundant resource provisioning, effective
failure detection and appropriate fail-over policies, including fencing.
Imposing all these requirements on Nova is impractical.  We may need to
consider whether VM HA, if ever implemented/supported, should be part of
the orchestration service, aka Heat.


3) Can/should we do the VM HA orchestration in Heat?

My perception is that it can be done in Heat, based on my limited
understandig of how Heat works.  It may imply some requirements to other
projects (e.g.  nova, cinder, neutron ...) as well, though Heat should be
the orchestrator.

What do we need then?

   - A resource type for VM groups/clusters, for the redundant
     provisioning.  VMs in the group can be identical instances, managed
     by a Pacemaker setup among the VMs, just like a WatchRule in Heat can
     be controlled by Ceilometer.

     Another way to do this is to have the VMs monitored via heartbeat
     messages sent by Nova (if possible/needed), or some services injected
     into the VMs (consider what cfn-hup, cfn-signal does today).

     However, the VM group/cluster can decide how to react to a VM online
     /offline signal.  It may choose to a) restart the VM in-place; b)
     remote-restart (aka evacuate) the VM somewhere else; c) live/cold
     migrate the VM to other nodes.

     The policies can be out sourced to other plugins considering that
     global load-balancing or power management requirements.  But that is an
     advanced feature that warrants another blueprint.

   - Some fencing support from nova, cinder, neutron to shoot the bad VMs
     in the head so a VM that cannot be reached is guarantteed to be cleanly
     killed.

   - VM failure detectors that can reliably tell whether a VM has failed.
     Sometimes a VM that failed the expected performance goal should be
     treated as failed as well, if we really want to be strict on this.

     A failure detector can reside inside Nova, as what has been done for
     the 'service groups' there.  It can reside inside a VM, as a service
     istalled there, sending out heatbeat messages (before the battery runs
     out, :))

   - A generic signaling mechanism that allows a secure message delivery
     back to Heat indicating that a VM is alive or dead.

My current understanding is that we may avoid complicated task-flow
here.

Regards,
   - Qiming

Qiming,

If you read my original post on this thread, it outlines the current heat-core thinking, which is to reduce the scope of this resource from the Heat resources since it describes a workflow rather then an orchestrated thing (a Noun).

A good framework for HA already exists for HA in the HARestarter resource. It incorporates HA escalation, which is a critical feature of any HA system. The fundamental problem with HARestarter is that is in the wrong project.

Long term, HA, if desired, should be part of taskflow, though, because its a verb, and verbs don't belong as heat orchestrated resources.

How we get from here to there is left as an exercise to the reader ;-)

Regards
-steve

For the most part we've been trying to encourage projects that want to
control VMs to add such functionality to the Orchestration program, aka
"Heat".
Yes, exactly.

-jay

Hey folks,

Just as a note for HA for VMs, our current heat-core thinking is our
HARestarter resource functionality is a workflow (Restarter is a
verb, rather then a Noun - Heat orchestrates Nouns) and would be
better suited to a workflow service like Mistral.  Clearly we don't
know how to get from where we are today to the proper separation of
concerns as pointed out by Zane Bitter in recent threads on the ml
but just throwing this out there so folks are aware.

Regards
-steve


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to