[
https://issues.apache.org/jira/browse/CLOUDSTACK-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koushik Das resolved CLOUDSTACK-8666.
-------------------------------------
Resolution: Fixed
> Put host in Alert state only after alert.wait timeout
> -----------------------------------------------------
>
> Key: CLOUDSTACK-8666
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.5.0, 4.6.0
> Reporter: Koushik Das
> Assignee: Koushik Das
> Fix For: 4.6.0
>
>
> When there is a ping timeout on a host, investigators try to determine the
> state of a host. If none of the investigators are able to determine the host
> state then the process is repeated after some time. This works most of the
> time except some boundary scenarios. For e.g. if last host or all host in a
> XS cluster are brought down then the investigators are not able to determine
> the host state and the investigation process never completes. In such
> scenarios host state always remain as Up.
> In order to fix these boundary scenarios, a fix was made (refer to commit
> 4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in
> Alert state if investigators are not able to determine the state after ping
> timeout.
> The fix solved the boundary scenarios but introduced a new issue. Suppose
> there is a XS cluster with 2 hosts and the master host is brought down. In
> this case XS elects a new master for the cluster. Since master is down,
> investigators won't able to determine host state until a new master is
> elected. If this master election takes more than ping timeout to complete
> then the host is put to Alert based on the above fix. Once this happens, the
> host continues to remain in Alert state and no actions are taken on the VMs
> on this host. In this case if the investigators were allowed to run for 1 or
> 2 more times, possibly the new master election would have completed and host
> state correctly determined.
> In order to fix both these issues, instead of putting the host to Alert state
> immediately, the investigators should be allowed to run for some time based
> on alert.wait global config. At the end of this interval if the host state
> still cannot be determined then put the host in Alert.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)