Re: [openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Zane Bitter Tue, 18 Mar 2014 12:35:33 -0700

On 18/03/14 12:42, Steven Dake wrote:

On 03/18/2014 07:54 AM, Qiming Teng wrote:

Hi, Folks,


   I have been trying to implement a HACluster resource type in Heat. I
haven't created a BluePrint for this because I am not sure everything
will work as expected.

   The basic idea is to extend the OS::Heat::ResourceGroup resource type
with inner resource types fixed to be OS::Nova::Server.  Properties for
this HACluster resource may include:

   - init_size: initial number of Server instances;
   - min_size: minimal number of Server instances;
   - sig_handler: a reference to a sub-class of SignalResponder;
   - zones: a list of strings representing the availability zones, which
           could be a names of the rack where the Server can be booted;
   - recovery_action: a list of supported failure recovery actions, such
       as 'restart', 'remote-restart', 'migrate';
   - fencing_options: a dict specifying what to do to shutdown the Server
       in a clean way so that data consistency in storage and network are
       reserved;
   - resource_ref: a dict for defining the Server instances to be
       created.

   Attributes of the HACluster may include:
   - refs: a list of resource IDs for the currently active Servers;
   - ips: a list of IP addresses for convenience.

   Note that the 'remote-restart' action above is today referred to as
'evacuate'.

   The most difficult issue here is to come up with a reliable VM failure
detection mechanism.  The service_group feature in Nova only concerns
about the OpenStack services themselves, not the VMs.  Considering that
in our customer's cloud environment, user provided images can be used,
we cannot assume some agents in the VMs to send heartbeat signals.

   I have checked the 'instance' table in Nova database, it seemed that
the 'update_at' column is only updated when VM state changed and
reported.  If the 'heartbeat' messages are coming in from many VMs very
frequently, there could be a DB query performance/scalability issue,
right?

   So, how can I detect VM failures reliably, so that I can notify Heat
to take the appropriate recovery action?

Qiming,

Check out

https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template


You should be able to use the HARestarter resource and functionality to
do healthchecking of a vm.

HARestarter is actually pretty problematic, both in a "causes majorarchitectural headaches for Heat and will probably be deprecated verysoon" sense and a "may do very unexpected things to your resources"sense. I wouldn't recommend it.


cheers,
Zane.

It would be cool if nova could grow a feature to actively look at the
vm's state internally and determine if it was healthy (eg look at its
memory and see if the scheduler is running, things like that) but this
would require individual support from each hypervisor for such
functionality.

Until that happens, healthchecking from within the vm seems like the
only reasonable solution.

Regards
-steve

Regards,
   - Qiming

Research Scientist
IBM Research - China
tengqim at cn dot ibm dot com


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [Nova][Heat] How to reliably detect VM failures?

Reply via email to