On Tue, Mar 18, 2014 at 09:42:18AM -0700, Steven Dake wrote: > On 03/18/2014 07:54 AM, Qiming Teng wrote: > >Hi, Folks, > > > > I have been trying to implement a HACluster resource type in Heat. I > >haven't created a BluePrint for this because I am not sure everything > >will work as expected. > > > > The basic idea is to extend the OS::Heat::ResourceGroup resource type > >with inner resource types fixed to be OS::Nova::Server. Properties for > >this HACluster resource may include: > > > > - init_size: initial number of Server instances; > > - min_size: minimal number of Server instances; > > - sig_handler: a reference to a sub-class of SignalResponder; > > - zones: a list of strings representing the availability zones, which > > could be a names of the rack where the Server can be booted; > > - recovery_action: a list of supported failure recovery actions, such > > as 'restart', 'remote-restart', 'migrate'; > > - fencing_options: a dict specifying what to do to shutdown the Server > > in a clean way so that data consistency in storage and network are > > reserved; > > - resource_ref: a dict for defining the Server instances to be > > created. > > > > Attributes of the HACluster may include: > > - refs: a list of resource IDs for the currently active Servers; > > - ips: a list of IP addresses for convenience. > > > > Note that the 'remote-restart' action above is today referred to as > >'evacuate'. > > > > The most difficult issue here is to come up with a reliable VM failure > >detection mechanism. The service_group feature in Nova only concerns > >about the OpenStack services themselves, not the VMs. Considering that > >in our customer's cloud environment, user provided images can be used, > >we cannot assume some agents in the VMs to send heartbeat signals. > > > > I have checked the 'instance' table in Nova database, it seemed that > >the 'update_at' column is only updated when VM state changed and > >reported. If the 'heartbeat' messages are coming in from many VMs very > >frequently, there could be a DB query performance/scalability issue, > >right? > > > > So, how can I detect VM failures reliably, so that I can notify Heat > >to take the appropriate recovery action? > Qiming, > > Check out > > https://github.com/openstack/heat-templates/blob/master/cfn/F17/WordPress_Single_Instance_With_HA.template > > You should be able to use the HARestarter resource and functionality > to do healthchecking of a vm. > > It would be cool if nova could grow a feature to actively look at > the vm's state internally and determine if it was healthy (eg look > at its memory and see if the scheduler is running, things like that) > but this would require individual support from each hypervisor for > such functionality. > > Until that happens, healthchecking from within the vm seems like the > only reasonable solution. > > Regards > -steve >
Yes, Steve, HARestarter is an option. I have been playing with the template you mentioned, for quite some days to make it work. Since I was using RAW, not CFN_TOOLS, as the userdata_format for Servers, I passed the CFN credentials, BOTO configs, among other files using CloudConfig. To make heat-cfntools happy, I had to: - write the BOTO configs into /var/lib/heat-cfntools/cfn-boto-cfg because cfn-init hardcoded the BOTO_CONFIG environment variable. - provide a AWS::CloudFormation::Init metadata, to make cfn-init happy, despite that I was not using EC2::Instance for VM server. - provide faked AWS::StackName and AWS::Region since these are not working properly now. The VM instance now can contact the CFN endpoint and CloudWatch endpoint, correctly signal WaitCondition and other messages. However, I do see it a solution tightly bound to heat-cfntools, or, not generic enough, or may deprecate some day soon. Then, back to my original question. What else can we do for reliably --------------------------------- detect VM failures? ------------------- We have noticed VM HA support from Windows Azure[1], CloudStack[2], VMware vSphere[3], even Linux-HA[4], for example. It would be highly desirable to have some support from OpenStack. Our customers keep ask for this feature, anyway. Regards, Qiming [1] http://www.windowsazure.com/en-us/documentation/articles/manage-availability-virtual-machines/ [2] http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.0.2/html/Admin_Guide/ha-enabled-vm.html [3] http://www.vmware.com/products/vsphere/features-high-availability [4] http://linux-ha.org/doc/man-pages/re-ra-VirtualDomain.html _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev