Re: [openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Steven Dake Tue, 18 Mar 2014 09:58:17 -0700

On 03/18/2014 07:54 AM, Qiming Teng wrote:

Hi, Folks,


   I have been trying to implement a HACluster resource type in Heat. I
haven't created a BluePrint for this because I am not sure everything
will work as expected.

   The basic idea is to extend the OS::Heat::ResourceGroup resource type
with inner resource types fixed to be OS::Nova::Server.  Properties for
this HACluster resource may include:

   - init_size: initial number of Server instances;
   - min_size: minimal number of Server instances;
   - sig_handler: a reference to a sub-class of SignalResponder;
   - zones: a list of strings representing the availability zones, which
           could be a names of the rack where the Server can be booted;
   - recovery_action: a list of supported failure recovery actions, such
       as 'restart', 'remote-restart', 'migrate';
   - fencing_options: a dict specifying what to do to shutdown the Server
       in a clean way so that data consistency in storage and network are
       reserved;
   - resource_ref: a dict for defining the Server instances to be
       created.

   Attributes of the HACluster may include:
   - refs: a list of resource IDs for the currently active Servers;
   - ips: a list of IP addresses for convenience.

   Note that the 'remote-restart' action above is today referred to as
'evacuate'.

   The most difficult issue here is to come up with a reliable VM failure
detection mechanism.  The service_group feature in Nova only concerns
about the OpenStack services themselves, not the VMs.  Considering that
in our customer's cloud environment, user provided images can be used,
we cannot assume some agents in the VMs to send heartbeat signals.

   I have checked the 'instance' table in Nova database, it seemed that
the 'update_at' column is only updated when VM state changed and
reported.  If the 'heartbeat' messages are coming in from many VMs very
frequently, there could be a DB query performance/scalability issue,
right?

   So, how can I detect VM failures reliably, so that I can notify Heat
to take the appropriate recovery action?

Qiming,

Check out

https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template

You should be able to use the HARestarter resource and functionality todo healthchecking of a vm.

It would be cool if nova could grow a feature to actively look at thevm's state internally and determine if it was healthy (eg look at itsmemory and see if the scheduler is running, things like that) but thiswould require individual support from each hypervisor for suchfunctionality.

Until that happens, healthchecking from within the vm seems like theonly reasonable solution.


Regards
-steve

Regards,
   - Qiming

Research Scientist
IBM Research - China
tengqim at cn dot ibm dot com


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Reply via email to