ggoodrich-ipp opened a new pull request #3915: Incorporate VR OOB start checks 
to properly HA the VR
URL: https://github.com/apache/cloudstack/pull/3915
 
 
   ## Description
   <!--- Describe your changes in detail -->
   The file VirtualNetworkApplianceManagerImpl.java is edited for a related VM 
HA problem. When a Host is determined to be DOWN, CloudStack attempts to VM HA 
any affected routers. The problem is, when the host is determined to be down, 
by code referenced above, the host may not actually be DOWN. On KVM for 
example, the host is considered DOWN if the agent is stopped on the KVM host 
for too long. In that case, the VMs could still be running just fine... However 
when we think the host is DOWN, VM HA runs on the router and as part of that it 
unallocates/cleans-up the router and it's 169.x.x.x control IP is unallocated. 
Then after it cleans it up, it tries to power on the router on another host, 
and as part of that it allocates a NEW 169.x.x.x control IP and writes that to 
the DB. However, since the router isn't actually down (we just think the host 
is down) the VM HA fails as the vRouter is currently still running on the 
problem host.
   
   Next, in this example, when the host agent is back online again, it sends a 
power report to the mgmt servers, and the management servers think the router 
was powered-on OOB. However, the GUI will not show a control IP for the 
vRouter, and the DB will have the NEW control IP it tried to allocated during 
the failed VM HA event. Thus, leaving us unable to communicate with the vRouter.
   
   This PR does a simple check that we can still communicate with the vRouter 
after any OOB power-on occurs. If we can, then we have the correct control IP 
in the DB and we're good - so we do nothing. If we can't communicate with the 
vRouter after the OOB power-on, we do a reboot of the vRouter to fix it.
   <!-- For new features, provide link to FS, dev ML discussion etc. -->
   <!-- In case of bug fix, the expected and actual behaviours, steps to 
reproduce. -->
   
   <!-- When "Fixes: #<id>" is specified, the issue/PR will automatically be 
closed when this PR gets merged -->
   <!-- For addressing multiple issues/PRs, use multiple "Fixes: #<id>" -->
   <!-- Fixes: # -->
   
   ## Types of changes
   <!--- What types of changes does your code introduce? Put an `x` in all the 
boxes that apply: -->
   - [ ] Breaking change (fix or feature that would cause existing 
functionality to change)
   - [ ] New feature (non-breaking change which adds functionality)
   - [x] Bug fix (non-breaking change which fixes an issue)
   - [x] Enhancement (improves an existing feature and functionality)
   - [ ] Cleanup (Code refactoring and cleanup, that may add test cases)
   
   ## Screenshots (if appropriate):
   
   ## How Has This Been Tested?
   <!-- Please describe in detail how you tested your changes. -->
   <!-- Include details of your testing environment, and the tests you ran to 
-->
   <!-- see how your change affects other areas of the code, etc. -->
   I ran this sql statement to simulate an OOB power on by making CloudStack 
believe the router is down, but the host then sending a power report stating it 
is running:
   
   -- id = 157 is the row id of the virtual router in the table 
   `update vm_instance set 
state='Stopped',power_state='PowerReportMissing',host_id=NULL where id=157;
   `
   
   I then observed that the router got marked as OOB started, and was 
considered healthy, and no further action was taken.
   
   I then ran the sql statement above again, to make cloudstack believe the 
router is down, and then connected to the router via cloudstack-ssh and took it 
to run level 1 via 'init 1' to effectively make it so that it cannot be 
connected to.
   
   I then observed that the router was restarted by cloudstack, and verified 
the logs on the management server
   
   <!-- Please read the 
[CONTRIBUTING](https://github.com/apache/cloudstack/blob/master/CONTRIBUTING.md)
 document -->
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to