[
https://issues.apache.org/jira/browse/CLOUDSTACK-7415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108874#comment-14108874
]
ASF subversion and git services commented on CLOUDSTACK-7415:
-------------------------------------------------------------
Commit 8ce6eba549bcd3fa007aaf10a29c3a2fef9ffaaa in cloudstack's branch
refs/heads/master from [~likithas]
[ https://git-wip-us.apache.org/repos/asf?p=cloudstack.git;h=8ce6eba ]
CLOUDSTACK-7415. Host remains in Alert after vCenter restart.
Management server PingTask should update PingMap entry for an agent only if it
is already present in the Management Server's PingMap.
> Host remains in Alert after vCenter restart
> -------------------------------------------
>
> Key: CLOUDSTACK-7415
> URL: https://issues.apache.org/jira/browse/CLOUDSTACK-7415
> Project: CloudStack
> Issue Type: Bug
> Security Level: Public(Anyone can view this level - this is the
> default.)
> Components: Management Server
> Affects Versions: 4.0.0
> Reporter: Likitha Shetty
> Assignee: Likitha Shetty
> Priority: Critical
> Fix For: 4.5.0
>
>
> In a clustered management server environment, after a vCenter restart some
> hosts repeatedly go back into alert state even after the vCenter comes up.
> Root caused the issue to the below race condition -
> There is a scheduled PingTask that is run for every host and the interval at
> which it is run is configurable (global config - ping.interval). When vCenter
> gets restarted, PingTask is unable to get the host status and so it schedules
> another task to handle the disconnect for the host agent.
> This disconnect task determines the host status by sending CheckHeathCommand
> to the agent. When the command returns an answer that says the resource is
> not alive, CS performs further investigations and in this case VMware
> investigator confirms the host to be in disconnected state. After which
> disconnect is processed which involves the following -
> 1. Cancel all scheduled tasks for that agent which includes PingTask
> 2. Send disconnect to all listeners including AgentMonitor which clears the
> agent from MS's PingMap
> If the above disconnect takes a while to get scheduled and spills over to the
> next PingTask interval, then the next PingTask runs wherein if by now the
> vCenter is Up and host is connected the Ping is successful and hence an entry
> for the agent is made in the PingMap.
> Once an entry is made in the PingMap after a disconnect, every minute the
> AgentMonitor task will run to find the agent behind on Ping, disconnect host
> agent without investigation because the attache is no longer connected and
> put the host back into Alert state.
--
This message was sent by Atlassian JIRA
(v6.2#6252)