Re: [Proposal] - StorageHA

Tutkowski, Mike Tue, 14 Mar 2017 06:26:07 -0700

Thanks for your clarification. I see now. You were referring to a networking 
problem where one host could not see the storage (but the storage was still up 
and running).


On 3/13/17, 10:31 PM, "Jeromy Grimmett" <[email protected]> wrote:

    I apologize for the delay on the response, let me clarify the points 
requested:
    
    Mike asked:
    
    "What I was curious about is if you plan to exclusively build your feature 
as a set of scripts and/or if you plan to update the CloudStack code base, as 
well."
    
    JG:  My idea was to do this separately as a plugin, then add it to the code 
base down the road.
    
    "Also, if a primary storage actually goes offline, I'm not clear on how 
starting an impacted VM on a different compute host would help. Could you 
clarify this for me?"
    
    JG:  The VM would be started on another host that still has access to the 
storage.  Individually a host can have problems and lose its connectivity to a 
primary storage device.  The solution we are working on would help to get the 
VM back and up running much faster than waiting for Cloudstack to make a 
decision to restart the VM on a different host.
    
    Paul asked:
    
    "  1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)"
    
    JG:  I should have been more clear, this is for KVM hosts.
      
    "2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that."
    
    JG:  I have faced this Jumbo Frame issue as well, perhaps we need to have 
an option that would indicate Jumbo Frames are being used to access that 
storage and the test result would reflect a failure to access using Jumbo 
Frames. 
    
    "3.  You need to be very sure of failures before shutting hosts down.  Also 
a host is likely to be connected to multiple storage pools, so you wouldn't 
want to shut down a host due to one pool becoming unavailable."
    
    JG:  The script wouldn’t shut down any hosts at all.  Just force stop the 
affected VMs on that specific host and then start them on a host that is not 
having the issue with storage.
    
    "4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates."
    
    JG:  The polling/testing time increments are configurable, so I am hoping 
that can help with that.  The results are pretty small and should be relatively 
negligible.
    
    "5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners"
    
    JG:  I have copied Alex on this email to make sure he sees this suggestion. 
 We will figure out how to incorporate that 'state' field.
    
    "6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI."
    
    JG:  For now, I think this might be a feature request that maybe we should 
submit through the normal Cloudstack request process.  Otherwise, we can 
definitely include that into our work when we start to add it into the code 
base.
    
    To take this a step further, we are also working on a KVM host load 
balancer that will be used as a factor when moving the VMs.  We have a number 
of little projects we are working on.
    
    Thank you all for reviewing the information.  All suggestions are welcome.
    
    Jeromy Grimmett
    P: 603.766.3625
    [email protected]
    www.cloudbrix.com
    
    
    -----Original Message-----
    From: Paul Angus [mailto:[email protected]] 
    Sent: Saturday, March 11, 2017 2:43 AM
    To: [email protected]
    Subject: RE: [Proposal] - StorageHA
    
    Hi Jeromy,
    
    I love the idea, I'm not really a developer, so those guys will look at 
things a different way, but...
    
    These would be by my initial comments:
    
    
      1.  We can't/don't run scripts on vSphere hosts (not sure about Hyper-V)
      2.  I know of one failure scenario (which happened) where MTU issues in 
intermediate switches meant that small amounts of data could pass, but anything 
that was passed as jumbo frames then failed. So it would be important to 
exercise that.
      3.  You need to be very sure of failures before shutting hosts down.  
Also a host is likely to be connected to multiple storage pools, so you 
wouldn't want to shut down a host due to one pool becoming unavailable.
      4.  Environments can have hundreds of storage pools, so watch out for 
spamming the logs with updates.
      5.  The primary storage pools have a 'state' which should get updated and 
used by the deployment planners
      6.  Secondary storage pools don't have a 'state' - but it would be great 
if that were added in the DB and reflected in the UI.
    
    
    
    Kind regards,
    
    Paul Angus
    
    
    [email protected]
    www.shapeblue.com
    53 Chandos Place, Covent Garden, London  WC2N 4HSUK @shapeblue
      
     
    
    From: Jeromy Grimmett [mailto:[email protected]]
    Sent: 10 March 2017 15:28
    To: [email protected]
    Subject: [Proposal] - StorageHA
    
    Hello,
    
    I am new to the mailing list, and we are glad to be a part of the 
CloudStack community.  We are looking to develop plugins and modules that will 
help grow and expand the adoption and use of CloudStack.  So as part of my 
introductory email, I'd like to introduce a little project we have been working 
on; a StorageHA Monitor.  The Monitor would allow CloudStack and the hosts to 
test, communicate and resolve VM availability issues when storage (primary 
and/or secondary) availability becomes apparent.  This is a small write up 
about how it would work:
    
    Consists of two scripts/programs:
    
    The host script runs on the host servers and checks to see if the primary 
and secondary storage is available by doing a read/write test then reports to 
the master script that runs on the Cloudstack server. The host script will test 
a read and a write to the storage every 5 seconds (configurable), and if it 
fails 3 times (configurable) then it will be recorded by the master script.
    
    The master script will monitor the results of the host script. If the test 
is good, nothing happens and the results are logged and so that we can track 
the history of the test results. If the test reports back as failed, then it 
will perform the following actions:
    
    
      *   Secondary Storage - It will simply generate and send an alert that 
the failure has occurred.
    
    
      *   Primary Storage - The script will perform the following tasks:
         *   Generate and send an alert that the failure has occurred.
         *   Force the VMs on that host to shutdown.
         *   Determine which host to move the VMs to.
         *   Start the VMs on the healthy host.
    
    We have already started working on some code, and the solution seems to be 
testing well.  Any thoughts/ideas/input are(is) welcome.  Should there are a 
solution out there already, then please forgive our ignorance, and point us in 
the right direction. We look forward to further collaboration with you all.
    
    Regards,
    j
    
    Jeromy Grimmett
    [cb-sig-logo2]
    155 Fleet Street
    Portsmouth, NH 03801
    Direct: 603.766.3625
    Office: 603.766.4908
    Fax: 603.766.4729
    [email protected]<mailto:[email protected]>
    www.cloudbrix.com<http://www.cloudbrix.com/>

Re: [Proposal] - StorageHA

Reply via email to