Koushik Das created CLOUDSTACK-8666:
---------------------------------------
Summary: Put host in Alert state only after alert.wait timeout
Key: CLOUDSTACK-8666
URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8666
Project: CloudStack
Issue Type: Bug
Security Level: Public (Anyone can view this level - this is the default.)
Components: Management Server
Affects Versions: 4.5.0, 4.6.0
Reporter: Koushik Das
Assignee: Koushik Das
Fix For: 4.6.0
When there is a ping timeout on a host, investigators try to determine the
state of a host. If none of the investigators are able to determine the host
state then the process is repeated after some time. This works most of the time
except some boundary scenarios. For e.g. if last host or all host in a XS
cluster are brought down then the investigators are not able to determine the
host state and the investigation process never completes. In such scenarios
host state always remain as Up.
In order to fix these boundary scenarios, a fix was made (refer to commit
4a13f81485c0f0664c60acafe534946e7428f080) to immediately put the host in Alert
state if investigators are not able to determine the state after ping timeout.
The fix solved the boundary scenarios but introduced a new issue. Suppose there
is a XS cluster with 2 hosts and the master host is brought down. In this case
XS elects a new master for the cluster. Since master is down, investigators
won't able to determine host state until a new master is elected. If this
master election takes more than ping timeout to complete then the host is put
to Alert based on the above fix. Once this happens, the host continues to
remain in Alert state and no actions are taken on the VMs on this host. In this
case if the investigators were allowed to run for 1 or 2 more times, possibly
the new master election would have completed and host state correctly
determined.
In order to fix both these issues, instead of putting the host to Alert state
immediately, the investigators should be allowed to run for some time based on
alert.wait global config. At the end of this interval if the host state still
cannot be determined then put the host in Alert.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)