[openstack-dev] [Nova][Heat] How to reliably detect VM failures? (Zane Bitter)

WICKES, ROGER Wed, 19 Mar 2014 06:15:51 -0700

> On 03/18/2014 07:54 AM, Qiming Teng wrote:
>> Hi, Folks,
>>
>>    I have been trying to implement a HACluster resource type in Heat. I
>> haven't created a BluePrint for this because I am not sure everything
>> will work as expected.
...
>>    The most difficult issue here is to come up with a reliable VM failure
>> detection mechanism.  The service_group feature in Nova only concerns
>> about the OpenStack services themselves, not the VMs.  Considering that
>> in our customer's cloud environment, user provided images can be used,
>> we cannot assume some agents in the VMs to send heartbeat signals.


[Roger] My response is more of a user-oriented rather than developer-
oriented, but was asked on dev so...here goes:

When enabled, the hypervisor is always collecting (and sending to 
Ceilometer) basic cpu, memory stats that you can alarm on. 
http://docs.openstack.org/trunk/openstack-ops/content/logging_monitoring.html

For external monitoring, consider setting up a Nagios or Selenium server 
for agent-less monitoring. You can have it do the most basic heartbeat 
(ping) test; if the ping is slow for a period of say five minutes, or fails, 
alarm 
that you have a network problem. You can use Selenium to execute synthetic
transactions against whatever the server is supposed to provide; if it does it
for you, you can assume it is doing it for everyone else. If it fails, you can 
take action
http://www.seleniumhq.org
You can also use Selenium to re-run selected OpenStack test cases to ensure 
your 
infrastructure is working properly.

>>    I have checked the 'instance' table in Nova database, it seemed that
>> the 'update_at' column is only updated when VM state changed and
>> reported.  If the 'heartbeat' messages are coming in from many VMs very
>> frequently, there could be a DB query performance/scalability issue,
>> right?

[Roger] For time-series, high-volume collection, consider going to a 
non-relational 
system like RRDTool, PyRRD, Graphite, etc. if you want to store the history and 
look 
for trends. 

>>    So, how can I detect VM failures reliably, so that I can notify Heat
>> to take the appropriate recovery action?

[Roger] When Nagios detects a problem, have it kick off the appropriate script
(shell script) that invokes the Heat API or other to fix the issue with the 
cluster. 
I think you were hoping that Heat could be coded to automagically fix any 
issue, 
but I think you may need to be more specific; develop specific use cases for 
what 
you mean by "VM failure", as the desired action may be different depending on 
the type of failure. 

> Qiming,
>
> Check out
>
> https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] [Nova][Heat] How to reliably detect VM failures? (Zane Bitter)

Reply via email to