[
https://issues.apache.org/jira/browse/CLOUDSTACK-10246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382718#comment-16382718
]
ASF GitHub Bot commented on CLOUDSTACK-10246:
---------------------------------------------
Slair1 opened a new pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA
issues
URL: https://github.com/apache/cloudstack/pull/2474
The HA logic just does not work. VM's with HA enabled would never restart
after a host failure. Had to re-do most of that logic. There are comments
inline with the code, but here is the general updated logic. Sorry for the
long notes...
We are running KVM FYI.
- If host-agent is unreachable, handleDisconnectWithInvestigation() is
called as always.
- The investigators are called to see what happened, which is one of the
following two scenarios. (If it isn't one of the two below, then the host just
came back UP, or another status was returned and that is also logged. But the
two scenarios below are what needed updated the most)
**If the investigators find the host is UP, but just the agent is
unreachable**
The host is put into DISCONNECTED status. It will stay in this status and
the PingTimeouts will continue to call handleDisconnectWithoutInvestigation()
periodically. It will stay in DISCONNECTED status until the AlertWait config
option expires. If the AlertWait time eventually is hit, and the investigators
are still just reporting that the host is DISCONNECTED and not DOWN. Then
we'll put the host into ALERT state and we'll stay there until the
investigators say the host is UP or the investigators say the host is DOWN. If
the host goes DOWN, then VM HA will be initiated.
**If the investigators find the host is DOWN**
Then VM HA is initiated...
**VirtualNetworkApplianceManagerImpl.java**
The file VirtualNetworkApplianceManagerImpl.java is edited for a related VM
HA problem. When a Host is determined to be DOWN, CloudStack attempts to VM HA
any affected routers. The problem is, when the host is determined to be down,
by code referenced above, the host may not actually be DOWN. On KVM for
example, the host is considered DOWN if the agent is stopped on the KVM host
for too long. In that case, the VMs could still be running just fine...
However when we think the host is DOWN, VM HA runs on the router and as part of
that it unallocates/cleans-up the router and it's 169.x.x.x control IP is
unallocated. Then after it cleans it up, it tries to power on the router on
another host, and as part of that it allocates a NEW 169.x.x.x control IP and
writes that to the DB. However, since the router isn't actually down (we just
think the host is down) the VM HA fails as the vRouter is currently still
running on the problem host.
Next, in this example, when the host agent is back online again, it sends a
power report to the mgmt servers, and the management servers think the router
was powered-on OOB. However, the GUI will not show a control IP for the
vRouter, and the DB will have the NEW control IP it tried to allocated during
the failed VM HA event. Thus, leaving us unable to communicate with the
vRouter.
This PR does a simple check that we can still communicate with the vRouter
after any OOB power-on occurs. If we can, then we have the correct control IP
in the DB and we're good - so we do nothing. If we can't communicate with the
vRouter after the OOB power-on, we do a reboot of the vRouter to fix it.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> VM HA issues
> ------------
>
> Key: CLOUDSTACK-10246
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10246
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.11.0.0
> Environment: My setup is CentOS 7 Management server with 3 CentOS 7
> KVM HVs, NFS as primary and secondary storages.
> Reporter: Nux
> Priority: Major
>
> VM HA fails to kick in when one of the hypervisors goes down.
> It even fails to restart the system VMs which remain down along with the
> instances until the affected HV comes back online.
> When I crash or power off the HV the system marks it in the hosts list as
> "Alert" or "Disconnected" respectively. It should get changed to "Down" after
> that, but this never happens.
>
> I have tried various combinations of setups (Adv, Basic), none succeeded.
>
> My instances use HA enabled offerings.
> Management server DEBUG logs here:
> [http://tmp.nux.ro/CW4-vmhafail-411rc1.txt]
>
>
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)