Hi Koushik,

Thanks for sharing your comments and questions.


1. Yes, the FS is divided into two parts - a general HA framework which makes 
no assumption about the type of resource and HA provider implementation that 
works on a type of resource/hypervisor/storage etc. Specifically, with this 
feature we want to solve the problem of HA-ing a host reliably and use 
out-of-band management subsystem (i.e. ipmi based status/reboot/power-off to 
investigate/recover/fence the host) in the HA provider implementation. Yes, a 
host HA should trigger VM HA, i.e. for the host being fenced move HA VMs to 
other hosts. This also reliably solves the issue of disk corruption when same 
HA VMs get started on multiple hosts.


2. The old VM HA implementation makes a lot of assumptions about the type of 
resource (i.e. VM) it is HA-ing, it is tied to VM HA which is why HA for host 
could not be added in a straight forward way without regressions we could not 
test. With this new HA framework, it does not make any assumption around type 
of the resource and separates policy from mechanism, we also want to add 
deterministic tests (using marvin tests and a simulator based ha provider 
implementation) to demonstrate the generic HA functionality. In future with 
this framework, HA for various resources such as VM, storage, network can be 
added. As a first step we want to get the framework in, and support for Host as 
a resource type. We also want to reduce assumptions, or dependency as both VM 
HA and Host HA are related (sequence etc). The HAProvider interface would be 
something every hypervisor can implement.


3. While an existing (VM) HA framework exists, it was safer to write new code 
and demonstrate it works for any general HA resource than refactor and 
implement this in the old framework which could introduce serious regressions 
leading to production issues. For the most part, we've avoided to alter 
anything in the old HA framework while making sure that old (VM) HA works well 
with the new HA framework. The JIRA issue for the feature is in the FS.


4. Any HA operation can be blocking in nature, one of the things included is a 
background polling manager that polls for changes, and a task/activity executor 
as out-of-band operations can take time. Therefore, all the 
health/activity/fencing/recovery operations have some timeout, limits and 
specific queues. The existing framework does not provide any abstraction to 
queue, restrict operation timeout, and tie them against a FSM. The existing 
framework also is hard to test, specifically to validate using integration 
test. We also wanted to avoid adding any regressions to existing/old VM HA. 
Lastly, the primary use of IPMI/out-of-band management in performing host-ha is 
not for investigation but for recovery (try a reboot), and fencing (power off).


Hope this answers your questions, please feel free add more comments and 
questions. Thanks.


Regards.


________________________________
From: Koushik Das <koushik....@accelerite.com>
Sent: 20 February 2017 11:45
To: dev@cloudstack.apache.org
Subject: Re: [DISCUSS][FS] Host HA for CloudStack

Rohit,

Thanks for the effort you have put in writing the FS. I have some questions 
based on my initial reading of the FS.

1. “Host HA” – In the FS you are talking about a generic HA framework but it is 
not clear what is meaning of “host HA”. Is it something like all or some VMs 
running on a host will be started on another host(s) in case of a failure or is 
it something else? How is it different from the existing “VM HA” that is 
already there?
2. You have mentioned that “Cloudstack lacks a way to reliably fence host”. 
Cloudstack considers VM as a 1st class object and so provides fencing for VM 
instead of host. There are hypervisor specific plugins that implement mechanism 
to fence a VM. I am not sure if it makes sense to expose host fencing as end 
user doesn’t care about it. Now the VM fencing implementation can use something 
like “host fencing” internally.
3. There is an existing HA framework which provides plugins for doing 
investigation if a VM is alive or not, host is alive or not, fencing of VM in 
case it is not alive. It will be good to understand the limitations of the 
existing framework and how the new framework helps in solving these problems. 
We also need to understand if the limitation is in the framework or some 
specific plugin implementation that is causing issues. Reference to JIRA issues 
would help.
4. You have mentioned about ipmi to investigate host failure. I would like to 
understand why same can’t be used in the existing framework.

Thanks,
Koushik

On 16/02/17, 4:48 PM, "Rohit Yadav" <rohit.ya...@shapeblue.com> wrote:

    All,


    I would like to start discussion on a new feature - Host HA for CloudStack.

    CloudStack lacks a way to reliably fence a host, the idea of the host-ha 
feature is to provide a general purpose HA framework and HA provider 
implementation specific for hypervisor that can use additional mechanism such 
as OOBM (ipmi based power management) to reliably investigate, recover and 
fence a host. This feature can handle scenarios associated with server crash 
issues and reliable fencing of hosts and HA of VM. The first version will have 
HA provider implementation for KVM (and for simulator to test the framework 
implementation, and write marvin tests that can validate the feature on Travis 
and others).


    Please have a look at the FS here:

    https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA


    Looking forward to your comments and questions.


    Regards.

    rohit.ya...@shapeblue.com
    www.shapeblue.com<http://www.shapeblue.com>
    53 Chandos Place, Covent Garden, London  WC2N 4HSUK
    @shapeblue








DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Accelerite, a Persistent Systems business. It is intended only for 
the use of the individual or entity to which it is addressed. If you are not 
the intended recipient, you are not authorized to read, retain, copy, print, 
distribute or use this message. If you have received this communication in 
error, please notify the sender and delete all copies of this message. 
Accelerite, a Persistent Systems business does not accept any liability for 
virus infected mails.

rohit.ya...@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 

Reply via email to