[
https://issues.apache.org/jira/browse/CLOUDSTACK-9350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15242823#comment-15242823
]
ASF GitHub Bot commented on CLOUDSTACK-9350:
--------------------------------------------
GitHub user abhinandanprateek opened a pull request:
https://github.com/apache/cloudstack/pull/1496
CLOUDSTACK-9350: KVM-HA- Fix CheckOnHost for Local storage
- KVM-HA- Fix CheckOnHost for Local storage
- Also skip HA on VMs that are using local storage
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/shapeblue/cloudstack kvm-ha
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/cloudstack/pull/1496.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1496
----
commit 5144820eb30190a4c0a32d63f56396c279aa44d8
Author: Abhinandan Prateek <[email protected]>
Date: 2016-04-15T11:16:30Z
CLOUDSTACK-9350: KVM-HA- Fix CheckOnHost for Local storage
- Also skip HA on VMs that are using local storage
----
> Local storage hosts get HA tasks, cause issues
> -----------------------------------------------
>
> Key: CLOUDSTACK-9350
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-9350
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Affects Versions: 4.5.1
> Reporter: Abhinandan Prateek
> Assignee: Abhinandan Prateek
>
> When a host hits its ping time out, for whatever reason, the investigators
> are triggered. The KVMInvestigator sends a CheckOnHostCommand to the target
> host, and then to all the remaining neighbor hosts in the cluster. The
> CheckOnHostCommand (and also FenceCommand, the code is nearly identical) is
> processed by the KVM agent and simply scans through all NFS primary storage
> looking for the host's heartbeat in the KVMHA directory. If no heartbeat file
> is found, it fails the check. In the case of clusters that are local-only,
> these hosts will always fail the check, whether it be the target host or a
> neighbor checking on the target. This triggers a host 'down' event, which
> triggers HA tasks. The HA tasks will attempt to stop any VMs on the host, and
> then if the VM's offering is HA-enabled it will try to restart the VM.
> Our recent issue was that a management server took extraordinarily long to
> rotate its logs and was slow to process some host pings. The
> CheckOnHostCommand was sent to a suspect host, which failed because it had no
> primary NFS. The neighbor checks also failed to check the suspect host's
> heartbeat for the same reason. Then the host was marked as down and all VMs
> were stopped. Multiply this by a few dozen hosts.
> The immediate fix, provided in the example, is a patch to KVMInvestigator
> which will only attempt investigation if the host's cluster has NFS storage,
> which is a requirement for the host to run the check, as described above. If
> there is none, the host state is determined to be disconnected rather than
> down. This means that the host will still end up in alert state and need
> manual investigation, but there will be no attempt to stop or HA the VMs.
> Additionally, the patch catches scenarios where a cluster might have both NFS
> and local storage and a host ends up in 'down' state. In this case, when the
> HA tasks are being created, if a VM is using local storage then the HA task
> generation is skipped. This VM can't be started anywhere else.
> We could also make the agent side more robust, in KVMHAChecker we may not
> want it to return 'false' if there were zero pools passed to check for HA
> heartbeat. Then again, maybe we do. We decided initially to patch just the
> server side, because it is easier to deploy.
> In the long run, I'd hope that the current HA work would supercede the
> current KVMInvestigator and take the cluster's ability to pass any defined
> checks into account before checking.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)