On 2/16/17, 5:18 AM, "Rohit Yadav" <[email protected]> wrote:
All,
I would like to start discussion on a new feature - Host HA for CloudStack.
CloudStack lacks a way to reliably fence a host, the idea of the host-ha
feature is to provide a general purpose HA framework and HA provider
implementation specific for hypervisor that can use additional mechanism such
as OOBM (ipmi based power management) to reliably investigate, recover and
fence a host. This feature can handle scenarios associated with server crash
issues and reliable fencing of hosts and HA of VM. The first version will have
HA provider implementation for KVM (and for simulator to test the framework
implementation, and write marvin tests that can validate the feature on Travis
and others).
Please have a look at the FS here:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
Looking forward to your comments and questions.
Regards.
[email protected]
www.shapeblue.com
53 Chandos Place, Covent Garden, London WC2N 4HSUK
@shapeblue
Rohit,
First, thanks for all the work you have put into this. This is something that
CS has sorely needed for a long time.
A couple of items:
1.) You state the following:
“Before invoking the HA provider’s fence operation, the HA resource management
will place the resource in maintenance mode. The intention is to require an
administrator to manually verify that a resource is ready to return service by
requiring an administrator to take it out of maintenance mode.”
I agree that putting a host in maintenance mode to require manual intervention
in order to bring it back online is ideal and honestly how I would probably
prefer to do it. However, I also like to give the end user/operator choice.
Perhaps we could add an option to bring the Host out of Maintenance mode
automatically if it passes all checks and comes back into an ELIGIBLE state.
This way, if the operator chooses, the host could come back into full operation
and start recovering VMs if needed. This could also be handy if your
environment isn’t quite n+1 when it comes to host capacity and you need to have
the host back up and running as soon as possible to minimize the outage
duration. Again, I know it isn’t ideal, but I don’t see the harm in giving the
operator the choice.
2.) You state the following:
“For the initial release, only KVM with NFS storage will be supported. However,
the storage check component will be implemented in a modular fashion allowing
for checks using other storage platforms(e.g. Ceph) in the future. HA provider
plugins can be implemented for other hypervisors.”
We are using KVM with a Ceph backend and would be very interested in helping
make it a part of the initial push for this feature. I have a Dev environment
backed by Ceph that we could use for teseting and would be willing to help with
the development of the Ceph activity checks.
I’m looking forward to getting this feature added to CS. Again, great job
putting this together and starting the conversation.
Thanks,
Mabry