> On 03/18/2014 07:54 AM, Qiming Teng wrote: >> Hi, Folks, >> >> I have been trying to implement a HACluster resource type in Heat. I >> haven't created a BluePrint for this because I am not sure everything >> will work as expected. ... >> The most difficult issue here is to come up with a reliable VM failure >> detection mechanism. The service_group feature in Nova only concerns >> about the OpenStack services themselves, not the VMs. Considering that >> in our customer's cloud environment, user provided images can be used, >> we cannot assume some agents in the VMs to send heartbeat signals.
[Roger] My response is more of a user-oriented rather than developer- oriented, but was asked on dev so...here goes: When enabled, the hypervisor is always collecting (and sending to Ceilometer) basic cpu, memory stats that you can alarm on. http://docs.openstack.org/trunk/openstack-ops/content/logging_monitoring.html For external monitoring, consider setting up a Nagios or Selenium server for agent-less monitoring. You can have it do the most basic heartbeat (ping) test; if the ping is slow for a period of say five minutes, or fails, alarm that you have a network problem. You can use Selenium to execute synthetic transactions against whatever the server is supposed to provide; if it does it for you, you can assume it is doing it for everyone else. If it fails, you can take action http://www.seleniumhq.org You can also use Selenium to re-run selected OpenStack test cases to ensure your infrastructure is working properly. >> I have checked the 'instance' table in Nova database, it seemed that >> the 'update_at' column is only updated when VM state changed and >> reported. If the 'heartbeat' messages are coming in from many VMs very >> frequently, there could be a DB query performance/scalability issue, >> right? [Roger] For time-series, high-volume collection, consider going to a non-relational system like RRDTool, PyRRD, Graphite, etc. if you want to store the history and look for trends. >> So, how can I detect VM failures reliably, so that I can notify Heat >> to take the appropriate recovery action? [Roger] When Nagios detects a problem, have it kick off the appropriate script (shell script) that invokes the Heat API or other to fix the issue with the cluster. I think you were hoping that Heat could be coded to automagically fix any issue, but I think you may need to be more specific; develop specific use cases for what you mean by "VM failure", as the desired action may be different depending on the type of failure. > Qiming, > > Check out > > https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
