[
https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638730#comment-14638730
]
ASF GitHub Bot commented on CLOUDSTACK-8666:
--------------------------------------------
GitHub user koushik-das opened a pull request:
https://github.com/apache/cloudstack/pull/621
CLOUDSTACK-8666: Put host in Alert state only after alert.wait timeout
Instead of putting the host to Alert state immediately, the investigators
should be allowed to run for some time based on alert.wait global config.
At the end of this interval if the host state still cannot be determined
then put the host in Alert. Also updated some of the log messages.
Refer to the bug for the detailed description.
Since these scenarios are difficult to simulate, haven't written any tests.
If anyone has suggestions on some tests please let me know.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/koushik-das/cloudstack CLOUDSTACK-8666
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/cloudstack/pull/621.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #621
----
commit 2cd8e6726540186b9ccec41beebe21af3a6d08b6
Author: Koushik Das <[email protected]>
Date: 2015-07-23T12:27:51Z
CLOUDSTACK-8666: Put host in Alert state only after alert.wait timeout
Instead of putting the host to Alert state immediately, the investigators
should be allowed to run for some time based on alert.wait global config.
At the end of this interval if the host state still cannot be determined
then put the host in Alert. Also updated some of the log messages.
----
> Put host in Alert state only after alert.wait timeout
> -----------------------------------------------------
>
> Key: CLOUDSTACK-8666
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.5.0, 4.6.0
> Reporter: Koushik Das
> Assignee: Koushik Das
> Fix For: 4.6.0
>
>
> When there is a ping timeout on a host, investigators try to determine the
> state of a host. If none of the investigators are able to determine the host
> state then the process is repeated after some time. This works most of the
> time except some boundary scenarios. For e.g. if last host or all host in a
> XS cluster are brought down then the investigators are not able to determine
> the host state and the investigation process never completes. In such
> scenarios host state always remain as Up.
> In order to fix these boundary scenarios, a fix was made (refer to commit
> 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in
> Alert state if investigators are not able to determine the state after ping
> timeout.
> The fix solved the boundary scenarios but introduced a new issue. Suppose
> there is a XS cluster with 2 hosts and the master host is brought down. In
> this case XS elects a new master for the cluster. Since master is down,
> investigators won't able to determine host state until a new master is
> elected. If this master election takes more than ping timeout to complete
> then the host is put to Alert based on the above fix. Once this happens, the
> host continues to remain in Alert state and no actions are taken on the VMs
> on this host. In this case if the investigators were allowed to run for 1 or
> 2 more times, possibly the new master election would have completed and host
> state correctly determined.
> In order to fix both these issues, instead of putting the host to Alert state
> immediately, the investigators should be allowed to run for some time based
> on alert.wait global config. At the end of this interval if the host state
> still cannot be determined then put the host in Alert.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)